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CHAPTER 4 

INSTRUCTION SET REFERENCE, M-U 


4.1 IMM8 CONTROL BYTE OPERATION FOR PCMPESTRI / PCMPESTRM / 
PCMPISTRI / PCMPISTRM 

The notations introduced in this section are referenced in the reference pages of PCMPESTRI, PCMPESTRM, PCMP¬ 
ISTRI, PCMPISTRM. The operation of the immediate control byte is common to these four string text processing 
instructions of SSE4.2. This section describes the common operations. 


4.1.1 General Description 

The operation of PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM is defined by the combination of the respec¬ 
tive opcode and the interpretation of an immediate control byte that is part of the instruction encoding. 

The opcode controls the relationship of input bytes/words to each other (determines whether the inputs terminated 
strings or whether lengths are expressed explicitly) as well as the desired output (index or mask). 

The ImmS Control Byte for PCMPESTRM/PCMPESTRI/PCMPISTRM/PCMPISTRI encodes a significant amount of 
programmable control over the functionality of those instructions. Some functionality is unique to each instruction 
while some is common across some or all of the four instructions. This section describes functionality which is 
common across the four instructions. 

The arithmetic flags (ZF, CF, SF, OF, AF, PF) are set as a result of these instructions. Flowever, the meanings of the 
flags have been overloaded from their typical meanings in order to provide additional information regarding the 
relationships of the two inputs. 

PCMPxSTRx instructions perform arithmetic comparisons between all possible pairs of bytes or words, one from 
each packed input source operand. The boolean results of those comparisons are then aggregated in order to 
produce meaningful results. The ImmS Control Byte is used to affect the interpretation of individual input elements 
as well as control the arithmetic comparisons used and the specific aggregation scheme. 

Specifically, the ImmS Control Byte consists of bit fields that control the following attributes: 

• Source data format — Byte/word data element granularity, signed or unsigned elements 

• Aggregation operation — Encodes the mode of per-element comparison operation and the aggregation of 
per-element comparisons into an intermediate result 

• Poiarity — Specifies intermediate processing to be performed on the intermediate result 

• Output seiection — Specifies final operation to produce the output (depending on index or mask) from the 
intermediate result 
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4.1.2 Source Data Format 

Table 4-1. Source Data Format 


Imm8[1:0] 

Meaning 

Description 

00b 

Unsigned bytes 

Both 128-bit sources are treated as packed, unsigned bytes. 

01b 

Unsigned words 

Both 128-bit sources are treated as packed, unsigned words. 

10b 

Signed bytes 

Both 128-bit sources are treated as packed, signed bytes. 

11b 

Signed words 

Both 128-bit sources are treated as packed, signed words. 


If the Imm8 Control Byte has bit[0] cleared, each source contains 16 packed bytes. If the bit is set each source 
contains 8 packed words. If the Imm8 Control Byte has bit[l] cleared, each input contains unsigned data. If the 
bit is set each source contains signed data. 


4.1.3 Aggregation Operation 

Table 4-2. Aggregation Operation 


Imm8[3:2] 

Mode 

Comparison 

00b 

Equal any 

The arithmetic comparison is "equal." 

01b 

Ranges 

Arithmetic comparison is "greater than or equal" between even indexed bytes/words of reg and 
each byte/word of reg/mem. 

Arithmetic comparison is "less than or equal" between odd indexed bytes/words of reg and each 
byte/word of reg/mem. 

(reg/mem[m] >= reg[n] for n = even, reg/mem[m] <= reg[n] for n = odd) 

10b 

Equal each 

The arithmetic comparison is "equal." 

11b 

Equal ordered 

The arithmetic comparison is "equal." 


All 256 (64) possible comparisons are always performed. The individual Boolean results of those comparisons are 
referred by "BoolRes[Reg/Mem element index, Reg element index]." Comparisons evaluating to "True" are repre¬ 
sented with a 1, False with a 0 (positive logic). The initial results are then aggregated into a 16-bit (8-bit) interme¬ 
diate result (IntResl) using one of the modes described in the table below, as determined by Imm8 Control Byte 
bit[3:2]. 


4-2 Vol. 2B 










INSTRUCTION SET REFERENCE, M-U 


See Section 4.1.6 for a description of the overrideIfDataInvalid() function used in Table 4-3. 

Table 4-3. Aggregation Operation 


Mode 

Pseudocode 

Equal any 

UpperBound = Imm8[0] ? 7 :15; 

(find characters from a set) 

IntResl = 0; 

For j = 0 to UpperBound, j++ 

For i = 0 to UpperBound, i++ 

IntResl [j] 0R= overrldelfDatalnvalid(BoolRes|j,l]) 

Ranges 

UpperBound = Imm8[0] ? 7 :15; 

(find characters from ranges) 

IntResl = 0; 

For j = 0 to UpperBound, j++ 

For i = 0 to UpperBound, i+=2 

IntResl 0] OR= (overrldelfDatalnvalid(BoolRes|j,i]) AND 
overrldelfDatalnvalld(BoolRes|j,i+1])) 

Equal each 

UpperBound = Imm8[0] ? 7 :15; 

(string compare) 

IntResl = 0; 

For i = 0 to UpperBound, i++ 

IntResl [1] = overrldelfDatalnvalld(BoolRes[l,i]) 

Equal ordered 

UpperBound = Imm8[0] ? 7 :15; 

(substring search) 

IntResl =lmm8[0]?FFH:FFFFH 

For j = 0 to UpperBound, j++ 

For i = 0 to UpperBound-j, l<=j to UpperBound, k++, i++ 

IntResl [j] AND= overridelfDatalnvalld(BoolRes[k,i]) 


4.1.4 Polarity 

IntResl may then be further modified by performing a I's complement, according to the value of the ImmS Control 
Byte bit[4]. Optionally, a mask may be used such that only those IntResl bits which correspond to "valid" reg/mem 
input elements are complemented (note that the definition of a valid input element is dependant on the specific 
opcode and is defined in each opcode's description). The result of the possible negation is referred to as IntResZ. 


Table 4-4. Polarity 


Imm8[5:4] 

Operation 

Description 

00b 

Positive Polarity (+) 

IntResZ = IntResl 

01b 

Negative Polarity (-) 

IntResZ = -1 XOR IntResl 

10b 

Masked (+) 

IntResZ = IntResl 

11b 

Masked (-) 

lntRes2[l] = IntResl [1] if reg/mem[i] invalid, else = ~lntRes1[i] 
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4.1.5 Output Selection 

Table 4-5. Output Selection 


Imm8[6] 

Operation 

Description 

Ob 

1b 

Least significant index 

Most significant index 

The index returned to ECX is of the least significant set bit in IntResZ. 

The index returned to ECX is of the most significant set bit in IntResZ. 


For PCMPESTRI/PCMPISTRI, the Imm8 Control Byte bit[6] is used to determine if the index is of the least significant 
or most significant bit of IntResZ. 


Table 4-6. Output Selection 


Imm8[6] 

Operation 

Description 

Ob 

Bit mask 

IntResZ is returned as the mask to the least significant bits of XMMO with zero extension to 1Z8 
bits. 

1b 

Byte/word mask 

IntResZ is expanded into a byte/word mask (based on imm8[1]) and placed in XMMO. The 
expansion is performed by replicating each bit into all of the bits of the byte/word of the same 
index. 


Specifically for PCMPESTRM/PCMPISTRM, the Imm8 Control Byte bit[6] is used to determine if the mask is a 16 (8) 
bit mask or a 128 bit byte/word mask. 


4.1.6 Valid/Invalid Override of Comparisons 

PCMPxSTRx instructions allow for the possibility that an end-of-string (EOS) situation may occur within the 128-bit 
packed data value (see the instruction descriptions below for details). Any data elements on either source that are 
determined to be past the EOS are considered to be invalid, and the treatment of invalid data within a comparison 
pair varies depending on the aggregation function being performed. 

In general, the individual comparison result for each element pair BoolRes[i.j] can be forced true or false if one or 
more elements in the pair are invalid. See Table 4-7. 


Table 4-7. Comparison Result for Each Element Pair BoolRes[i.j] 


xmmi 
byte/ word 

xmmZ/ ml 28 
byte/word 

Imm8[3:2] = 00b 
(equal any) 

Imm8[3:2] = 01b 
(ranges) 

Imm8[3:2] = 10b 
(equal each) 

Imm8[3:2]= 11b 
(equal ordered) 

Invalid 

Invalid 

Force false 

Force false 

Force true 

Force true 

Invalid 

Valid 

Force false 

Force false 

Force false 

Force true 

Valid 

Invalid 

Force false 

Force false 

Force false 

Force false 

Valid 

Valid 

Do not force 

Do not force 

Do not force 

Do not force 
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4.1.7 Summary of Im8 Control byte 


Table 4-8. Summary of Imm8 Control Byte 


ImmS 

Description 

.Ob 

128-bit sources treated as 16 packed bytes. 

.1b 

128-bit sources treated as 8 packed words. 

.0-b 

Packed bytes/words are unsigned. 

.1-b 

Packed bytes/words are signed. 

--00-b 

Mode is equal any. 

--01-b 

Mode is ranges. 

--10-b 

Mode is equal each. 

--11-b 

Mode is equal ordered. 

-0--b 

IntResI is unmodified. 


IntResI is negated (1's complement). 

-0--b 

Negation of IntResI is for all 16 (8) bits. 

.....b 

Negation of IntResI is masked by reg/mem validity. 

-0.b 

Index of the least significant, set, bit is used (regardless of corresponding input element validity). 
lntRes2 is returned in least significant bits of XMMO. 

-1.b 

Index of the most significant, set, bit is used (regardless of corresponding input element validity). 

Each bit of lntRes2 is expanded to byte/word. 

0.b 

This bit currently has no defined effect, should be 0. 

1.b 

This bit currently has no defined effect, should be 0. 
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4.1.8 Diagram Comparison and Aggregation Process 



PCMP'STRI only 


PCMP‘STRM only 


Figure 4-1. Operation of PCMPSTRx and PCMPESTRx 


4.2 COMMON TRANSFORMATION AND PRIMITIVE FUNCTIONS FOR SHA1XXX 
AND SHA256XXX 

The following primitive functions and transformations are used in the algorithmic descriptions of SHAl and SHA256 
instruction extensions SHAINEXTE, SHA1RNDS4, SHAIMSGI, SHA1MSG2, SHA256RNDS4, SHA256MSG1 and 
SHA256MSG2. The operands of these primitives and transformation are generally 32-bit DWORD integers. 

• f0(): A bit oriented logical operation that derives a new dword from three SHAl state variables (dword). This 
function is used in SHAl round 1 to 20 processing. 

fO(B,C,D) <r (B AND C) XOR ((NOT(B) AND D) 

• fl(): A bit oriented logical operation that derives a new dword from three SHAl state variables (dword). This 
function is used in SHAl round 21 to 40 processing. 

fl(B,C,D) ^ B XOR C XOR D 

• f2(): A bit oriented logical operation that derives a new dword from three SHAl state variables (dword). This 
function is used in SHAl round 41 to 60 processing. 

f2(B,C,D) ^ (B AND C) XOR (B AND D) XOR (C AND D) 
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• f3(): A bit oriented logical operation that derives a new dword from three SHAl state variables (dword). This 
function is used in SHAl round 61 to 80 processing. It is the same as fl(). 

f3(B,C,D) ^ B XOR C XOR D 

• Ch(): A bit oriented logical operation that derives a new dword from three SHA256 state variables (dword). 
Ch(E,F,G) ^ (E AND F) XOR ((NOT E) AND G) 

• Maj(): A bit oriented logical operation that derives a new dword from three SHA256 state variables (dword). 
Maj(A,B,C) ^ (A AND B) XOR (A AND C) XOR (B AND C) 

ROR is rotate right operation 
(A ROR N) <r A[N-1:0] 11 A[Width-l:N] 

ROL is rotate left operation 
(A ROL N) ^ A ROR (Width-N) 


SHR is the right shift operation 

(A SHR N) ^ ZEROES[N-1:0] 11 A[Width-l:N] 

• 2o( ): A bit oriented logical and rotational transformation performed on a dword SHA256 state variable. 

Zo(A) ^ (A ROR 2) XOR (A ROR 13) XOR (A ROR 22) 

• 2i( ): A bit oriented logical and rotational transformation performed on a dword SHA256 state variable. 
i;i(E) ^ (E ROR 6) XOR (E ROR 11) XOR (E ROR 25) 

• Oo( ): A bit oriented logical and rotational transformation performed on a SHA256 message dword used in the 
message scheduling. 

cro(W) <r (W ROR 7) XOR (W ROR 18) XOR (W SHR 3) 

• Oi( ): A bit oriented logical and rotational transformation performed on a SHA256 message dword used in the 
message scheduling. 

Oi(W) ^ (W ROR 17) XOR (W ROR 19) XOR (W SHR 10) 

• Kj! SHAl Constants dependent on immediate i. 

KO = 0X5A827999 

K1 = 0X6ED9EBA1 
K2 = 0X8F1BBCDC 
K3 = 0XCA62C1D6 

4.3 INSTRUCTIONS (M-U) 

Chapter 4 continues an alphabetical discussion of Intel® 64 and IA-32 instructions (M-U). See also: Chapter 3, 
"Instruction Set Reference, A-L," in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 
2A, and Chapter 5, "Instruction Set Reference, V-Z," in the Intel® 64 and IA-32 Architectures Software Devel¬ 
oper's Manual, Volume 2C. 
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MASKMOVDQU—Store Selected Bytes of Double Quadword 


Opcode/ 

Instruction 

Op/ 

En 

64/32-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

66 OF F7 /r 

MASKMOVDQU xmmi, xmmZ 

RM 

V/V 

SSE2 

Selectively write bytes from xmmi to 
memory location using the byte mask in 
xmmZ. The default memory location is 
specified by DS:DI/EDI/RDI. 

VEX.128.66.0F.WIG F7 /r 

VMASKMOVDQU xmmi, xmmZ 

RM 

v/v 

AVX 

Selectively write bytes from xmmi to 
memory location using the byte mask in 
xmmZ. The default memory location is 
specified by DS:DI/EDI/RDI. 


Instruction Operand Encoding^ 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

NA 

NA 


Description 

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand 
(second operand) selects which bytes from the source operand are written to memory. The source and mask oper¬ 
ands are XMM registers. The memory location specified by the effective address in the DI/EDI/RDI register (the 
default segment register is DS, but this may be overridden with a segment-override prefix). The memory location 
does not need to be aligned on a natural boundary. (The size of the store address depends on the address-size 
attribute.) 

The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source 
operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write. 

The MASKMOVDQU instruction generates a non-temporal hint to the processor to minimize cache pollution. The 
non-temporal hint is implemented by using a write combining (WC) memory type protocol (see "Caching of 
Temporal vs. Non-Temporal Data" in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer's 
Manual, Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing opera¬ 
tion implemented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVDQU 
instructions if multiple processors might use different memory types to read/write the destination memory loca¬ 
tions. 

Behavior with a mask of all Os is as follows: 

• No data will be written to memory. 

• Signaling of breakpoints (code or data) is not guaranteed; different processor implementations may signal or 
not signal these breakpoints. 

• Exceptions associated with addressing memory and page faults may still be signaled (implementation 
dependent). 

• If the destination memory region is mapped as UC or WP, enforcement of associated semantics for these 
memory types is not guaranteed (that is, is reserved) and is implementation-specific. 

The MASKMOVDQU instruction can be used to improve performance of algorithms that need to merge data on a 
byte-by-byte basis. MASKMOVDQU should not cause a read for ownership; doing so generates unnecessary band¬ 
width since data is to be written directly using the byte-mask without allocating old data prior to the store. 

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15). 
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 

If VMASKMOVDQU is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will 
cause an #UD exception. 


l.ModRM.MOD = OllB required 
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Operation 

IF(MASK[7]=1) 

THEN DEST[DI/EDI] ^ SRC[7:0] ELSE (* Memory location unchanged *); FI; 
IF(MASK[15]= 1) 

THEN DEST[DI/EDI +1] ^ SRC[15:8] ELSE (* Memory location unchanged *); FI; 

(* Repeat operation for 3rd through 14th bytes in source operand *) 

IF(MASK[127] = 1) 

THEN DEST[DI/EDI +15] ^ SRC[127:120] ELSE (* Memory location unchanged *); FI; 

Intel C/C++ Compiler Intrinsic Equivalent 

vold_mm_masl<moveu_sl128(_ml 281 d,_ml 281 n, char * p) 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L= 1 

If VEX.vvvv ^ llllB. 


MASKMOVDQU—Store Selected Bytes of Double Quadword 
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MASKMOVQ—Store Selected Bytes of C 

uadword 

Opcode/ 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF F7 /r 

MASKMOVQ mm 7, mm2 

RM 

Valid 

Valid 

Selectively write bytes from mm 7 to memory 
location using the byte mask in mm2. The 
default memory location is specified by 
DS:DI/EDI/RDI. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

NA 

NA 


Description 

Stores selected bytes from the source operand (first operand) into a 64-bit memory location. The mask operand 
(second operand) selects which bytes from the source operand are written to memory. The source and mask oper¬ 
ands are MMX technology registers. The memory location specified by the effective address in the DI/EDI/RDI 
register (the default segment register is DS, but this may be overridden with a segment-override prefix). The 
memory location does not need to be aligned on a natural boundary. (The size of the store address depends on the 
address-size attribute.) 

The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source 
operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write. 

The MASKMOVQ instruction generates a non-temporal hint to the processor to minimize cache pollution. The non¬ 
temporal hint is implemented by using a write combining (WC) memory type protocol (see "Caching of Temporal 
vs. Non-Temporal Data" in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer's Manual, 
Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation imple¬ 
mented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVQ instructions if 
multiple processors might use different memory types to read/write the destination memory locations. 

This instruction causes a transition from x87 FPU to MMX technology state (that is, the x87 FPU top-of-stack pointer 
is set to 0 and the x87 FPU tag word is set to all Os [valid]). 

The behavior of the MASKMOVQ instruction with a mask of all Os is as follows: 

• No data will be written to memory. 

• Transition from x87 FPU to MMX technology state will occur. 

• Exceptions associated with addressing memory and page faults may still be signaled (implementation 
dependent). 

• Signaling of breakpoints (code or data) is not guaranteed (implementation dependent). 

• If the destination memory region is mapped as UC or WP, enforcement of associated semantics for these 
memory types is not guaranteed (that is, is reserved) and is implementation-specific. 

The MASKMOVQ instruction can be used to improve performance for algorithms that need to merge data on a byte- 
by-byte basis. It should not cause a read for ownership; doing so generates unnecessary bandwidth since data is 
to be written directly using the byte-mask without allocating old data prior to the store. 

In 64-bit mode, the memory address is specified by DS:RDI. 
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Operation 

IF(MASK[7]= 1) 

THEN DEST[DI/EDI] ^ SRC[7:0] ELSE (* Memory location unchanged *); FI; 

IF(MASK[15]= 1) 

THEN DEST[DI/EDI +1] ^ SRC[15:8] ELSE (* Memory location unchanged *); FI; 

(* Repeat operation for 3rd through 6th bytes in source operand *) 

IF(MASK[63]=1) 

THEN DEST[DI/EDI +15] ^ SRC[63:56] ELSE (* Memory location unchanged *); FI; 

Intel C/C++ Compiler Intrinsic Equivalent 

void _mm_masl<move_si64(_m64d,_m64n, char * p) 

Other Exceptions 

See Table 22-8, "Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception," in the I ntel® 64 
and IA-32 Architectures Software Developer's Manual, Volume 3A. 


MASKMOVQ—Store Selected Bytes of Quadword 


Vol. 2B 4-11 


INSTRUCTION SET REFERENCE, M-U 


MAXPD—Maximum of Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

66 0F5F/r 

MAXPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Return the maximum double-precision floating-point 
values between xmmi and xmm2/m128. 

VEX.NDS.128.66.0F.WIG5F/r 

VMAXPD xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Return the maximum double-precision floating-point 
values between xmm2 and xmm3/m128. 

VEX.NDS.256.66.0F.WIG 5F /r 

VMAXPD ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Return the maximum packed double-precision 
floating-point values between ymm2 and 
ymm3/m256. 

EVEX.NDS.128.66.0F.W1 5F/r 

VMAXPD xmmi [kl }[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the maximum packed double-precision 
floating-point values between xmm2 and 
xmm3/m128/m64bcst and store result in xmmi 
subject to writemask kl. 

EVEX.NDS.256.66.0F.W1 5F /r 

VMAXPD ymmi {k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the maximum packed double-precision 
floating-point values between ymm2 and 
ymm3/m256/m64bcst and store result in ymmi 
subject to writemask kl. 

EVEX.NDS.51 2.66.0F.W1 5F /r 

VMAXPD zmmi {k1}{z}, zmm2, 
zmm3/m512/m64bcst{sae} 

FV 

v/v 

AVX512F 

Return the maximum packed double-precision 
floating-point values between zmm2 and 
zmm3/m512/m64bcst and store result in zmmi 
subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed double-precision floating-point values in the first source operand and the 
second source operand and returns the maximum value for each pair of values to the destination operand. 

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is 
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that 
is, a QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN 
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source 
operand (from either the first or second operand) be returned, the action of MAXPD can be emulated using a 
sequence of instructions, such as a comparison followed by AND, ANDN and OR. 

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally 
updated with writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) 
of the corresponding ZMM register destination are zeroed. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 
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128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. 

Operation 

MAX(SRC1,SRC2) 

{ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 > SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

VMAXPD (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FORj^OTO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

DEST[I+63:I] ^ MAX(SRC1 [i+63:i], SRC2[63:0]) 

ELSE 

DEST[I+63:I] ^ MAX(SRC1 [i+63:i], SRC2[I+63:I]) 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMAXPD (VEX.256 encoded version) 

DEST[63:0] ^MAX(SRC1 [63:0], SRC2[63:0]) 

DEST[127:64] ^MAX(SRC1 [127:64], SRC2[127:64]) 
DEST[191:128] ^MAX(SRC1 [191:128], SRC2[191:128]) 
DEST[255:192] ^MAX(SRC1 [255:192], SRC2[255:192]) 
DEST[MAX_VL-1:256] ^0 


VMAXPD (VEX.128 encoded version) 

DEST[63:0] ^MAX(SRC1 [63:0], SRC2[63:0]) 

DEST[127:64] ^MAX(SRC1 [127:64], SRC2[127:64]) 
DEST[MAX_VL-1:128] ^0 
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MAXPD (1 Z8-bit Legacy SSE version) 

DEST[63:0] ^MAX(DEST[63:0], SRC[63:0]) 

DEST[127:64] ^MAX(DEST[127:64], SRC[127:64]) 
DEST[MAX_VL-1:128] (Unmodified) 


Intel C/C++ Compiler Intrinsic Equivalent 

VMAXPD _m512d _mm512_max_pd( _m512d a, _m512d b); 

VMAXPD_mSIZd _mm512_mask_max_pd(_m512d s,_mmaskS k,_mSI 2d a,_mSI 2d b,); 

VMAXPD_m512d_mm512_maskz_max_pd(_mmaskS k,_mSIZd a,_mSIZd b); 

VMAXPD_m512d_mm512_max_round_pd(_mSIZd a,_mSIZd b, Int); 

VMAXPD_mS12d _mmS12_mask_max_round_pd(_mSI 2d s,_mmaskS k,_mSI 2d a,_mS12d b, Int); 

VMAXPD_mS12d_mmS12_maskz_max_round_pd(_mmaskS k,_mS12d a,_mS12d b, int); 

VMAXPD m2S6d _mm2S6_mask_max_pd( mS2S6d s, mmaskS k, m2S6d a, m2S6d b); 

VMAXPD m2S6d _mm2S6_maskz_max_pd( mmaskS k, m2S6d a, m2S6d b); 

VMAXPD_ml 28d _mm_mask_max_pd(_m128d s,_mmaskS k,_ml 28d a,_m128d b); 

VMAXPD_ml 28d _mm_maskz_max_pd(_mmaskS k,_ml 28d a,_ml 28d b); 

VMAXPD _m2S6d _mm2S6_max_pd (_m2S6d a, _m2S6d b); 

(V)MAXPD _m128d _mm_max_pd (_m128d a, _m128d b); 


SIMD Floating-Point Exceptions 

Invalid (including QNaN Source Operand), Denormal 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 
EVEX-encoded instruction, see Exceptions Type E2. 
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MAXPS—Maximum of Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 5F /r 

MAXPS xmm 1, xmm2/m 128 

RM 

V/V 

SSE 

Return the maximum single-precision floating-point values 
between xmmi andxmm2/mem. 

VEX.NDS.128.0F.WIG5F/r 
VMAXPS xmmi, xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Return the maximum single-precision floating-point values 
between xmm2 and xmm3/mem. 

VEX.NDS.256.0F.WIG 5F /r 
VMAXPS ymmi, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX 

Return the maximum single-precision floating-point values 
between ymm2 and ymm3/mem. 

EVEX.NDS.128.0F.W0 5F /r 
VMAXPS xmmi {k1]{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the maximum packed single-precision floating-point 
values between xmm2 and xmm3/m128/m32bcst and store 
result in xmmi subject to writemask kl. 

EVEX.NDS.256.0F.W0 5F /r 
VMAXPS ymmi {k1}[z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the maximum packed single-precision floating-point 
values between ymm2 and ymm3/m256/m32bcst and store 
result in ymmi subject to writemask kl. 

EVEX.NDS.512.0F.W0 5F/r 
VMAXPS zmmi {k1}{z}, zmm2, 
zmm3/m512/m32bcst[sae} 

FV 

v/v 

AVX512F 

Return the maximum packed single-precision floating-point 
values between zmm2 and zmm3/m512/m32bcst and store 
result in zmmi subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed single-precision floating-point values in the first source operand and the 
second source operand and returns the maximum value for each pair of values to the destination operand. 

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is 
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that 
is, a QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN 
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source 
operand (from either the first or second operand) be returned, the action of MAXPS can be emulated using a 
sequence of instructions, such as, a comparison followed by AND, ANDN and OR. 

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally 
updated with writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) 
of the corresponding ZMM register destination are zeroed. 

VEX. 128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. 
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Operation 

MAX(SRC1,SRC2) 

[ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 > SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

VMAXPS (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN 

DEST[I+31 :l] ^ MAX(SRC1 [i+31 :i], SRC2[31:0]) 

ELSE 

DEST[i+31 :l] ^ MAX(SRC1 [i+31 :i], SRC2[i+31 :i]) 
FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE DEST[i+31:l] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMAXPS (VEX.256 encoded version) 

DEST[31:0] eMAX(SRC1 [31:0], SRC2[31:0]) 

DEST[63:32] ^MAX(SRC1 [63:32], SRC2[63:32]) 
DEST[95:64] ^MAX(SRC1 [95:64], SRC2[95:64]) 

DEST[127:96] ^MAX(SRC1 [127:96], SRC2[1 27:96]) 
DEST[159:128] ^MAX(SRC1 [159:128], SRC2[159:128]) 
DEST[191:160] ^MAX(SRC1 [191:160], SRC2[191:160]) 
DEST[223:192] ^MAX(SRC1 [223:192], SRC2[223:192]) 
DEST[255:224] ^MAX(SRC1 [255:224], SRC2[255:224]) 
DEST[MAX_VL-1:256] ^0 


VMAXPS (VEX.128 encoded version) 

DEST[31:0] eMAX(SRC1 [31:0], SRC2[31:0]) 
DEST[63:32] ^MAX(SRC1 [63:32], SRC2[63:32]) 
DEST[95:64] ^MAX(SRC1 [95:64], SRC2[95:64]) 
DEST[127:96] ^MAX(SRC1 [127:96], SRC2[1 27:96]) 
DEST[MAX_VL-1:128] ^0 
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MAXPS (1 Z8-bit Legacy SSE version) 

DEST[31:0] ^MAX(DEST[31:0], SRC[31:0]) 

DEST[63:32] ^MAX(DEST[63:32], SRC[63:32]) 

DEST[95:64] ^MAX(DEST[95:64], SRC[95:64]) 

DEST[127:96] ^MAX(DEST[127:96], SRC[127:96]) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMAXPS _m512 _mm512_max_ps(_m512 a,_m512 b); 

VMAXPS_mSI 2 _mm512_mask_max_ps(_mSI 2 s,_mmaski 6 k,_mSI 2 a,_mSI 2 b); 

VMAXPS_mSI 2 _mm512_maskz_max_ps(_mmaski 6 k,_m512 a,_m512 b); 

VMAXPS_mSI 2 _mm512_max_round_ps(_m512 a,_m512 b, Int); 

VMAXPS_mSI 2 _mm512_mask_max_round_ps(_m512 s,_mmaski 6 k,_m512 a,_m512 b, int); 

VMAXPS_mSI 2 _mm512_maskz_max_round_ps(_mmaski 6 k,_mSI 2 a,_mSI 2 b, int); 

VMAXPS_m256 _mm256_mask_max_ps(_m256 s,_mmaskS k,_m256 a,_m256 b); 

VMAXPS_m256 _mm256_maskz_max_ps(_mmaskS k,_m256 a,_m256 b); 

VMAXPS_ml 28 _mm_mask_max_ps(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VMAXPS_ml 28 _mm_maskz_max_ps(_mmask8 k,_ml 28 a,_ml 28 b); 

VMAXPS _m256 _mm256_max_ps (_m256 a, _m256 b); 

MAXPS_ml 28 _mm_max_ps (_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

Invalid (including QNaN Source Operand), Denormal 
Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 

EVEX-encoded instruction, see Exceptions Type E2. 
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MAXSD—Return Maximum Scalar Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

F2 OF 5F /r 

MAXSD xmmi, xmm2/m64 

RM 

V/V 

SSE2 

Return the maximum scalar double-precision floating-point 
value between xmm2/m64 and xmmi. 

VEX.NDS.128.F2.0F.WIG5F/r 
VMAXSD xmmi, xmm2, 
xmm3/m64 

RVM 

v/v 

AVX 

Return the maximum scalar double-precision floating-point 
value between xmm3/m64 and xmm2. 

EVEX.NDS.LIG.F2.0F.W1 5F/r 
VMAXSD xmmi {k1]{z}, xmm2, 
xmm3/m64[sae] 

T1S 

V/V 

AVX512F 

Return the maximum scalar double-precision floating-point 
value between xmm3/m64 and xmm2. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Compares the low double-precision floating-point values in the first source operand and the second source 
operand, and returns the maximum value to the low quadword of the destination operand. The second source 
operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM 
registers. When the second source operand is a memory operand, only 64 bits are accessed. 

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If 
a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a 
QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid 
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source 
operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison 
followed by AND, ANDN and OR. 

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAX_VL-1:64) of the 
corresponding destination register remain unchanged. 

VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding 
bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low quadword element of the destination operand is updated according to the 
writemask. 

Software should ensure VMAXSD is encoded with VEX.L=0. Encoding VMAXSD with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 
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Operation 

MAX(SRC1,SRC2) 

{ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 > SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

VMAXSD (EVEX encoded version) 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ MAX(SRC1 [63:0], SRC2[63:0]) 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[63:0] ^ 0 
FI; 

FI; 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

VMAXSD (VEX.128 encoded version) 

DEST[63:0] ^MAX(SRC1 [63:0], SRC2[63:0]) 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

MAXSD (128-bit Legacy SSE version) 

DEST[63:0] ^MAX(DEST[63:0], SRC[63:0]) 

DEST[MAX_VL-1:64] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMAXSD_ml 28d _mm_max_round_sd(_ml 28d a,_ml 28d b, int); 

VMAXSD_ml 28d _mm_mask_max_round_sd(_ml 28d s,_mmask8 k,_ml 28d a,_ml 28d b, int); 

VMAXSD_ml 28d _mm_maskz_max_round_sd(_mmask8 k,_ml 28d a,_ml 28d b, int); 

MAXSD_m128d_mm_max_sd(_m128d a,_m128d b) 

SIMD Floating-Point Exceptions 

Invalid (Including QNaN Source Operand), Denormal 
Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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MAXSS—Return Maximum Scalar Single-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 5F /r 

MAXSS xmmi, xmm2/m32 

RM 

V/V 

SSE 

Return the maximum scalar single-precision floating-point 
value between xmm2/m32 and xmmi. 

VEX.NDS.128.F3.0F.WIG5F/r 
VMAXSS xmmi, xmm2, 
xmm3/m32 

RVM 

v/v 

AVX 

Return the maximum scalar single-precision floating-point 
value between xmm3/m32 and xmm2. 

EVEX.NDS.LIG.F3.0F.W0 5F /r 
VMAXSS xmmi {k1}{z}, xmm2, 
xmm3/m32[sae} 

T1S 

V/V 

AVX512F 

Return the maximum scalar single-precision floating-point 
value between xmm3/m32 and xmm2. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Compares the low single-precision floating-point values in the first source operand and the second source operand, 
and returns the maximum value to the low doubleword of the destination operand. 

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If 
a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a 
QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid 
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN from either source 
operand be returned, the action of MAXSS can be emulated using a sequence of instructions, such as, a comparison 
followed by AND, ANDN and OR. 

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination 
operands are XMM registers. 

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAX_VL:32) of the corre¬ 
sponding destination register remain unchanged. 

VEX. 128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. Bits 
(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits 
(MAX_VL:128) of the destination register are zeroed. 

EVEX encoded version: The low doubleword element of the destination operand is updated according to the 
writemask. 

Software should ensure VMAXSS is encoded with VEX.L=0. Encoding VMAXSS with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 


4-20 Vol. 2B 


MAXSS—Return Maximum Scalar Single-Precision Floating-Point Value 


















INSTRUCTION SET REFERENCE, M-U 


Operation 

MAX(SRC1,SRC2) 

{ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 > SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

VMAXSS (EVEX encoded version) 

IF k1 [0] or *no writemask* 

THEN DEST[31:0] ^ MAX(SRC1 [31:0], SRC2[31:0]) 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[31:0]^0 
FI; 

FI; 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128]^0 

VMAXSS (VEX.128 encoded version) 

DEST[31:0] ^MAX(SRC1 [31:0], SRC2[31:0]) 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

MAXSS (128-bit Legacy SSE version) 

DEST[31:0] ^MAX(DEST[31:0], SRC[31:0]) 

DEST[MAX_VL-1:32] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMAXSS_ml 28 _mm_max_round_ss(_ml 28 a,_ml 28 b, int); 

VMAXSS_ml 28 _mm_mask_max_round_ss(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b, int); 

VMAXSS_ml 28 _mm_maskz_max_round_ss(_mmask8 k,_ml 28 a,_ml 28 b, int); 

MAXSS_ml 28 _mm_max_ss(_ml 28 a,_ml 28 b) 

SIMD Floating-Point Exceptions 

Invalid (Including QNaN Source Operand), Denormal 
Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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MFENCE—Memory Fence 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF AE FO 

MFENCE 

NP 

Valid 

Valid 

Serializes load and store operations. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior 
the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes 
the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows 
the MFENCE instruction.^ The MFENCE instruction is ordered with respect to all load and store instructions, other 
MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID 
instruction). MFENCE does not serialize the instruction stream. 

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as 
out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of 
data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the 
producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store 
ordering between routines that produce weakly-ordered results and routines that consume that data. 

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and 
WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it 
is not ordered with respect to executions of the MFENCE instruction; data can be brought into the caches specula¬ 
tively just before, during, or after the execution of an MFENCE instruction. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Specification of the instruction's opcode above indicates a ModR/M byte of FO. For this instruction, the processor 
ignores the r/m field of the ModR/M byte. Thus, MFENCE is encoded by any opcode of the form OF AE Fx, where x 
is in the range 0-7. 

Operation 

Wait_On_Following_Loads_And_Stores_Until(preceding_loads_and_stores_globally_visible); 

Intel C/C++ Compiler Intrinsic Equivalent 

void _mm_mfence(void) 

Exceptions (All Modes of Operation) 

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0. 

If the LOCK prefix is used. 


1. A load Instruction Is considered to become globally visible when the value to be loaded into its destination register is determined. 
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MINPD—Minimum of Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 5D /r 

MINPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Return the minimum double-precision floating-point values 
between xmmi and xmm2/mem 

VEX.NDS.128.66.0F.WIG5D/r 
VMINPD xmmi, xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Return the minimum double-precision floating-point values 
between xmm2 and xmm3/mem. 

VEX.NDS.256.66.0F.WIG5D/r 
VMINPD ymmi, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX 

Return the minimum packed double-precision floating-point 
values between ymm2 and ymm3/mem. 

EVEX.NDS.128.66.0F.W1 5D/r 
VMINPD xmmi [k1}[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the minimum packed double-precision floating-point 
values between xmm2 and xmm3/m128/m64bcst and store 
result in xmmi subject to writemask kl. 

EVEX.NDS.256.66.0F.W1 5D /r 
VMINPD ymmi {k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the minimum packed double-precision floating-point 
values between ymm2 and ymm3/m256/m64bcst and store 
result in ymmi subject to writemask kl. 

EVEX.NDS.512.66.0F.W1 5D/r 
VMINPD zmmi {k1}[z}, zmm2, 
zmm3/m512/m64bcst[sae} 

FV 

v/v 

AVX512F 

Return the minimum packed double-precision floating-point 
values between zmm2 and zmm3/m512/m64bcst and store 
result in zmmi subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed double-precision floating-point values in the first source operand and the 
second source operand and returns the minimum value for each pair of values to the destination operand. 

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is 
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that 
is, a QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN 
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source 
operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a 
sequence of instructions, such as, a comparison followed by AND, ANDN and OR. 

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally 
updated with writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) 
of the corresponding ZMM register destination are zeroed. 

VEX. 128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. 


MINPD—Minimum of Packed Double-Precision Floating-Point Values 
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Operation 

MIN(SRC1,SRC2) 

[ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 < SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

VMINPD (EVEX encoded version) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN 

DEST[i+63:i] ^ MIN(SRC1 [i+63:i], SRC2[63:0]) 

ELSE 

DEST[i+63:l] ^ MIN(SRC1 [i+63:i], SRC2[i+63:i]) 
FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE DEST[i+63:l] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMINPD {VEX.256 encoded version) 

DEST[63:0] eMIN(SRC1[63:0], SRC2[63:0]) 

DEST[127:64] ^MIN(SRC1 [127:64], SRC2[127:64]) 
DEST[191:128] ^MIN(SRC1 [191:128], SRC2[191:128]) 
DEST[255:192] ^MIN(SRC1 [255:192], SRC2[255:192]) 


VMINPD (VEX.128 encoded version) 

DEST[63:0] eMIN(SRC1[63:0], SRC2[63:0]) 

DEST[127:64] ^MIN(SRC1 [127:64], SRC2[127:64]) 
DEST[MAX_VL-1:128] ^0 

MINPD (128-bit Legacy SSE version) 

DEST[63:0] eMIN(SRC1[63:0], SRC2[63:0]) 

DEST[127:64] ^MIN(SRC1 [127:64], SRC2[127:64]) 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivalent 

VMINPD _m512d _mm512_min_pd(_m512d a,_m512d b); 

VMINPD_mSI 2d _mm512_mask_mln_pd(_mSI 2d s,_mmaskS k,_mSI 2d a,_m512d b); 

VMINPD_mSI 2d _mm512_maskz_mln_pd(_mmaskS k,_mSI 2d a,_mSI 2d b); 

VMINPD_mSI 2d_mm512_min_round_pd(_mSI 2d a,_mSI 2d b, int); 

VMINPD_mSI 2d_mm512_mask_mln_round_pd(_mSI 2d s,_mmaskS k,_mSI 2d a,_mSI 2d b, Int); 

VMINPD_mSI 2d _mm512_maskz_mln_round_pd(_mmaskS k,_m512d a,_m512d b, int); 

VMINPD_m256d _mm256_mask_mln_pd(_m256d s,_mmaskS k,_m256d a,_m256d b); 

VMINPD_m256d _mm256_maskz_mln_pd(_mmaskS k,_m256d a,_m256d b); 

VMINPD_ml 2Sd _mm_mask_mln_pd(_m12Sd s,_mmaskS k,_m12Sd a,_m12Sd b); 

VMINPD_ml 2Sd _mm_maskz_mln_pd(_mmaskS k,_ml 2Sd a,_ml 2Sd b); 

VMINPD _m256d _mm256_min_pd (_m256d a, _m256d b); 

MINPD_ml 2Sd _mm_mln_pd (_ml 2Sd a,_ml 2Sd b); 

SIMD Floating-Point Exceptions 

Invalid (including QNaN Source Operand), Denormal 
Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 

EVEX-encoded instruction, see Exceptions Type E2. 
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MINPS—Minimum of Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

OF 5D /r 

MINPS xnnml, xmm2/m128 

RM 

V/V 

SSE 

Return the minimum single-precision floating-point values 
between xmmi andxmm2/mem. 

VEX.NDS.128.0F.WIC5D/r 
VMINPS xmmi, xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Return the minimum single-precision floating-point values 
between xmm2 and xmm3/mem. 

VEX.NDS.256.0F.WIC 5D /r 
VMINPS ymmi, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX 

Return the minimum single double-precision floating-point 
values between ymm2 and ymm3/mem. 

EVEX.NDS.128.0F.W0 5D/r 
VMINPS xmmi [kl }[z], xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the minimum packed single-precision floating-point 
values between xmm2 and xmm3/m128/m32bcst and store 
result in xmmi subject to writemask kl. 

EVEX.NDS.256.0F.W0 5D /r 
VMINPS ymmi {k1]{z], ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Return the minimum packed single-precision floating-point 
values between ymm2 and ymm3/m256/m32bcst and store 
result in ymmi subject to writemask kl. 

EVEX.NDS.512.0F.W0 5D/r 
VMINPS zmmi {k1}{z}, zmm2, 
zmm3/m512/m32bcst{sae} 

FV 

v/v 

AVX512F 

Return the minimum packed single-precision floating-point 
values between zmm2 and zmm3/m512/m32bcst and store 
result in zmmi subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed single-precision floating-point values in the first source operand and the 
second source operand and returns the minimum value for each pair of values to the destination operand. 

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is 
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that 
is, a QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN 
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source 
operand (from either the first or second operand) be returned, the action of MINPS can be emulated using a 
sequence of instructions, such as, a comparison followed by AND, ANDN and OR. 

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally 
updated with writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) 
of the corresponding ZMM register destination are zeroed. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. 
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Operation 

MIN(SRC1,SRC2) 

{ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 < SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

VMINPS (EVEX encoded version) 

(KL, VL) = (4,1 28), (8, 256), (16, 512) 

FOR) ^0 TO KL-1 
I ^j*32 

IF k10] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

DEST[I+31 :l] ^ MIN(SRC1 [i+31 :i], SRC2[31:0]) 

ELSE 

DEST[I+31 :l] ^ MIN(SRC1 [i+31 :i], SRC2[i+31 :i]) 
FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMINPS (VEX.256 encoded version) 

DEST[31:0] ^MIN(SRC1 [31:0], SRC2[31:0]) 

DEST[63:32] ^MIN(SRC1 [63:32], SRC2[63:32]) 
DEST[95:64] ^MIN(SRC1 [95:64], SRC2[95:64]) 

DEST[127:96] ^MIN(SRC1 [127:96], SRC2[127:96]) 
DEST[159:128] ^MIN(SRC1 [159:128], SRC2[159:128]) 
DEST[191:160] ^MIN(SRC1 [191:160], SRC2[191:160]) 
DEST[223:192] ^MIN(SRC1 [223:192], SRC2[223:192]) 
DEST[255:224] ^MIN(SRC1 [255:224], SRC2[255:224]) 


VMINPS (VEX.128 encoded version) 

DEST[31:0] ^MIN(SRC1 [31:0], SRC2[31:0]) 
DEST[63:32] ^MIN(SRC1 [63:32], SRC2[63:32]) 
DEST[95:64] ^MIN(SRC1 [95:64], SRC2[95:64]) 
DEST[127:96] ^MIN(SRC1 [127:96], SRC2[127:96]) 
DEST[MAX_VL-1:128] ^0 


MINES—Minimum of Packed Single-Precision Floating-Point Values 
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MINPS (128-bit Legacy SSE version) 

DEST[31:0] ^MIN(SRC1 [31:0], SRC2[31:0]) 

DEST[63:32] ^MIN(SRC1 [63:32], SRC2[63:32]) 

DEST[95:64] ^MIN(SRC1 [95:64], SRC2[95:64]) 

DEST[127:96] ^MIN(SRC1 [127:96], SRC2[127:96]) 

DEST[MAX_VL-1:128] (Unmodified) 

Intei C/C++ Compiier Intrinsic Equivaient 

VMINPS _m512 _mm512_min_ps( _m512 a, _m512 b); 

VMINPS_m512 _mm512_mask_min_ps(_m512 s,_mmaski 6 k,_m512 a,_m512 b); 

VMINPS_m512_mm512_maskz_mln_ps(_mmaski 6 k,_m512 a,_m512 b); 

VMINPS_m512_mm512_mln_round_ps(_m512 a,_m512 b, Int); 

VMINPS_m512 _mm512_mask_mln_round_ps(_m512 s,_mmaski 6 k,_m512 a,_m512 b, int); 

VMINPS_m512_mm512_maskz_min_round_ps(_mmaski 6 k,_m512 a,_m512 b, Int); 

VMINPS_m256 _mm256_mask_mln_ps(_m256 s,_mmaskS k,_m256 a,_m256 b); 

VMINPS_m256 _mm256_maskz_mln_ps(_mmaskS k,_m256 a,_m25 b); 

VMINPS_ml 28 _mm_mask_mln_ps(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VMINPS_ml 28 _mm_maskz_mln_ps(_mmask8 k,_ml 28 a,_ml 28 b); 

VMINPS _m256 _mm256_mln_ps (_m256 a, _m256 b); 

MINPS_ml 28 _mm_min_ps (_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

Invalid (including QNaN Source Operand), Denormal 
Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 

EVEX-encoded instruction, see Exceptions Type E2. 
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MINSD—Return Minimum Scalar Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F2 OF 5D /r 

MINSD xmmi, xmm2/m64 

RM 

V/V 

SSE2 

Return the minimum scalar double-precision floating¬ 
point value between xmm2/m64 and xmmi. 

VEX.NDS.128.F2.0F.WIG5D/r 

VMINSD xmmi, xmm2, xmm3/m64 

RVM 

v/v 

AVX 

Return the minimum scalar double-precision floating¬ 
point value between xmm3/m64 and xmm2. 

EVEX.NDS.LIG.F2.0F.W1 5D /r 

VMINSD xmmi {k1]{z}, xmm2, 
xmm3/m64[sae} 

T1S 

V/V 

AVX512F 

Return the minimum scalar double-precision floating¬ 
point value between xmm3/m64 and xmm2. 



nstruction Operand Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Compares the low double-precision floating-point values in the first source operand and the second source 
operand, and returns the minimum value to the low quadword of the destination operand. When the source 
operand is a memory operand, only the 64 bits are accessed. 

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If 
a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a 
QNaN version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid 
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand 
(from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of 
instructions, such as, a comparison followed by AND, ANDN and OR. 

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination 
operands are XMM registers. 

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAX_VL-1:64) of the 
corresponding destination register remain unchanged. 

VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding 
bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low quadword element of the destination operand is updated according to the 
writemask. 

Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 


MINSD—Return Minimum Scalar Double-Precision Floating-Point Value 


Vol. 2B 4-29 





















INSTRUCTION SET REFERENCE, M-U 


Operation 

MIN(SRC1,SRC2) 

[ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 < SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

MINSD (EVEX encoded version) 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ MIN(SRC1 [63:0], SRC2[63:0]) 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[63:0] ^ 0 
FI; 

FI; 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

MINSD (VEX.128 encoded version) 

DEST[63:0] eMIN(SRC1[63:0], SRC2[63:0]) 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

MINSD (128-bit Legacy SSE version) 

DEST[63:0] eMIN(SRC1[63:0], SRC2[63:0]) 

DEST[MAX_VL-1:64] (Unmodified) 

Intei C/C++ Compiier Intrinsic Equivaient 

VMINSD_ml 28d _mm_min_round_sd(_ml 28d a,_ml 28d b, Int); 

VMINSD_ml 28d _mm_mask_min_round_sd(_ml 28d s,_mmask8 k,_ml 28d a,_ml 28d b, int); 

VMINSD_ml 28d _mm_maskz_min_round_sd(_mmask8 k,_ml 28d a,_ml 28d b, int); 

MINSD_ml 28d _mm_min_sd(_ml 28d a,_ml 28d b) 

SIMD Floating-Point Exceptions 

Invalid (including QNaN Source Operand), Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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MINSS—Return Minimum Scalar Single-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 5D /r 

MINSS xmm1,xmm2/m32 

RM 

V/V 

SSE 

Return the minimum scalar single-precision floating¬ 
point value between xmm2/m32 and xmmi. 

VEX.NDS.128.F3.0F.WIG5D/r 

VMINSS xmm1,xmm2, xmm3/m32 

RVM 

v/v 

AVX 

Return the minimum scalar single-precision floating¬ 
point value between xmm3/m32 and xmm2. 

EVEX.NDS.LIG.F3.0F.W0 5D /r 

VMINSS xmmi {k1}[z}, xmm2, 
xmm3/m32[sae} 

T1S 

V/V 

AVX512F 

Return the minimum scalar single-precision floating¬ 
point value between xmm3/m32 and xmm2. 



nstruction Operand Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Compares the low single-precision floating-point values in the first source operand and the second source operand 
and returns the minimum value to the low doubleword of the destination operand. 

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If 
a value in the second operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN 
version of the SNaN is not returned). 

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid 
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN in either source 
operand be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison 
followed by AND, ANDN and OR. 

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination 
operands are XMM registers. 

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAX_VL:32) of the corre¬ 
sponding destination register remain unchanged. 

VEX. 128 and EVEX encoded version: The first source operand is an xmm register encoded by (E)VEX.vvvv. Bits 
(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits 
(MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low doubleword element of the destination operand is updated according to the 
writemask. 

Software should ensure VMINSS is encoded with VEX.L=0. Encoding VMINSS with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 
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Operation 

MIN(SRC1,SRC2) 

[ 

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ^SRC2; 

ELSE IF (SRC1 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC2 = SNaN) THEN DEST ^SRC2; FI; 

ELSE IF (SRC1 < SRC2) THEN DEST ^SRCI; 

ELSE DEST ^SRC2; 

FI; 

} 

MINSS (EVEX encoded version) 

IF k1 [0] or *no writemask* 

THEN DEST[31:0] ^ MIN(SRC1 [31:0], SRC2[31:0]) 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[31:0]^0 
FI; 

FI; 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128]^0 

VMINSS (VEX.128 encoded version) 

DEST[31:0] eMIN(SRC1 [31:0], SRC2[31:0]) 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

MINSS (128-bit Legacy SSE version) 

DEST[31:0] eMIN(SRC1 [31:0], SRC2[31:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMINSS_ml 28 _mm_min_round_ss(_ml 28 a,_ml 28 b, int); 

VMINSS_ml 28 _mm_mask_min_round_ss(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b, int); 

VMINSS_ml 28 _mm_maskz_min_round_ss(_mmask8 k,_ml 28 a,_ml 28 b, int); 

MINSS_ml 28 _mm_min_ss(_ml 28 a,_ml 28 b) 

SIMD Floating-Point Exceptions 

Invalid (Including QNaN Source Operand), Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 

EVEX-encoded instruction, see Exceptions Type E2. 


4-32 Vol. 2B 


MINSS—Return Minimum Scalar Single-Precision Floating-Point Value 


INSTRUCTION SET REFERENCE, M-U 


MONITOR—Set Up Monitor Address 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 C8 

MONITOR 

NP 

Valid 

Valid 

Sets up a linear address range to be 
monitored by hardware and activates the 
monitor. The address range should be a write¬ 
back memory caching type. The address is 
DS:EAX (DS:RAX in 64-bit mode). 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

The MONITOR instruction arms address monitoring hardware using an address specified in EAX (the address range 
that the monitoring hardware checks for store operations can be determined by using CPUID). A store to an 
address within the specified address range triggers the monitoring hardware. The state of monitor hardware is 
used by MWAIT. 

The content of EAX is an effective address (in 64-bit mode, RAX is used). By default, the DS segment is used to 
create a linear address that is monitored. Segment overrides can be used. 

ECX and EDX are also used. They communicate other information to MONITOR. ECX specifies optional extensions. 
EDX specifies optional hints; it does not change the architectural behavior of the instruction. For the Pentium 4 
processor (family 15, model 3), no extensions or hints are defined. Undefined hints in EDX are ignored by the 
processor; undefined extensions in ECX raises a general protection fault. 

The address range must use memory of the write-back type. Only write-back memory will correctly trigger the 
monitoring hardware. Additional information on determining what address range to use in order to prevent false 
wake-ups is described in Chapter 8, "Multiple-Processor Management" of the I ntei® 64 and IA-32 Architectures 
Software Developer's Manual, Volume 3A. 

The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction 
is subject to the permission checking and faults associated with a byte load. Like a load, MONITOR sets the A-bit 
but not the D-bit in page tables. 

CPUID.01FI:ECX.MONITOR[bit 3] indicates the availability of MONITOR and MWAIT in the processor. When set, 
MONITOR may be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode 
exception). The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE 
MSR; disabling MONITOR clears the CPUID feature flag and causes execution to generate an invalid-opcode excep¬ 
tion. 

The instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

MONITOR sets up an address range for the monitor hardware using the content of EAX (RAX in 64-bit mode) as an 
effective address and puts the monitor hardware in armed state. Always use memory of the write-back caching 
type. A store to the specified address range will trigger the monitor hardware. The content of ECX and EDX are 
used to communicate other information to the monitor hardware. 

Intel C/C++ Compiler Intrinsic Equivalent 

MONITOR: void _mm_monitor(void const *p, unsigned extensions,unsigned hints) 

Numeric Exceptions 

None 
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Protected Mode Exceptions 

#GP(0) If the value in EAX is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If ECX t- 0. 

#SS(0) If the value in EAX is outside the SS segment limit. 

#PF(fault-code) For a page fault. 

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0. 

If current privilege level is not 0. 


Real Address Mode Exceptions 

#GP If the CS, DS, ES, FS, or GS register is used to access memory and the value in EAX is outside 

of the effective address space from 0 to FFFFH. 

If ECX t- 0. 

#SS If the SS register is used to access memory and the value in EAX is outside of the effective 

address space from 0 to FFFFH. 

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0. 


Virtual 8086 Mode Exceptions 

#UD The MONITOR instruction is not recognized in virtual-8086 mode (even if 

CPUID.01H:ECX.MONITOR[bit 3] = 1). 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 


64-Bit Mode Exceptions 

#GP(0) If the linear address of the operand in the CS, DS, ES, FS, or GS segment is in a non-canonical 

form. 

If RCX 0. 


#SS(0) 

#PF(fault-code) 

#UD 


If the SS register is used to access memory and the value in EAX is in a non-canonical form. 
For a page fault. 

If the current privilege level is not 0. 

If CPUID.01H:ECX.MONITOR[bit 3] = 0. 
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MOV—Move 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

88 Ir 

MOV r/m8,r8 

MR 

Valid 

Valid 

Move r8 to r/m8. 

REX + 88 Ir 

MOV r/mS ’r8 

MR 

Valid 

N.E. 

Move r8 to r/m8. 

89 Ir 

MOM r/m16,r16 

MR 

Valid 

Valid 

Move r16 to r/m 16 . 

89 Ir 

MOV r/m32,r32 

MR 

Valid 

Valid 

Move r32 to r/m32. 

REX.W + 89 Ir 

MOV r/m64,r64 

MR 

Valid 

N.E. 

Move r64 to r/m64. 

8A Ir 

MOV r8,r/m8 

RM 

Valid 

Valid 

Move r/m8 to r8. 

REX + 8A Ir 

MOV r8***r/m8*** 

RM 

Valid 

N.E. 

Move r/m8 to r8. 

88 Ir 

MOM r16,r/mi 6 

RM 

Valid 

Valid 

Move r/m 16 to rl 6. 

88 Ir 

MOV r32,r/m32 

RM 

Valid 

Valid 

Move r/m32 to r32. 

REX.W + 88 Ir 

MOV r64,r/m64 

RM 

Valid 

N.E. 

Move r/m64 to r64. 

8C Ir 

MOV r/m16,Sreg** 

MR 

Valid 

Valid 

Move segment register to r/m16. 

REX.W + 8C Ir 

MOV r/m64,Sreg** 

MR 

Valid 

Valid 

Move zero extended 16-bit segment register 
to r/m64. 

8E/r 

MOV Sreg,r/m16** 

RM 

Valid 

Valid 

Move r/m 16 to segment register. 

REX.W + 8E Ir 

MOV Sreg,r/m64** 

RM 

Valid 

Valid 

Move lower 16 bits of r/m64 to segment 
register. 

AO 

MOV AL,mo//s8* 

FD 

Valid 

Valid 

Move byte at (seg:offset) to AL. 

REX.W + AO 

MOV AL,mo//sS* 

FD 

Valid 

N.E. 

Move byte at {offset) to AL. 

A1 

MOV AX,moffsl6* 

FD 

Valid 

Valid 

Move word at {seg:offset) to AX. 

A1 

MOV EAX,mo//s3Z* 

FD 

Valid 

Valid 

Move doubleword at (seg:offset) to EAX. 

REX.W + A1 

MOV RAX,mo//s64* 

FD 

Valid 

N.E. 

Move guadword at {offset) to RAX. 

A2 

MOV mo//s8,AL 

TD 

Valid 

Valid 

Move AL to {segioffset). 

REX.W + A2 

MOV mo//s8“‘,AL 

TD 

Valid 

N.E. 

Move AL to {offset). 

A3 

MOV moffs16*AX 

TD 

Valid 

Valid 

Move AX to {segioffset). 

A3 

MOV moffs32*,EAX 

TD 

Valid 

Valid 

Move EAX to {segioffset). 

REX.W + A3 

MOV mo//s64*,RAX 

TD 

Valid 

N.E. 

Move RAX to {offset). 

80+ rb ib 

MOV r8, imm8 

01 

Valid 

Valid 

Move imm8 to r8. 

REX + 80+ rb ib 

MOV r8 , imm8 

01 

Valid 

N.E. 

Move imm8 to r8. 

88+ rw iw 

m\l r16 , immi 6 

01 

Valid 

Valid 

Move /mm 7 6 to r16. 

88+ rd id 

MOV r32, imm32 

01 

Valid 

Valid 

Move imm32 to r32. 

REX.W + 88+ rd io 

MOV r64, imm64 

01 

Valid 

N.E. 

Move imm64 to r64. 

C6 10 ib 

MOV r/m8, imm8 

Ml 

Valid 

Valid 

Move imm8 to r/m8. 

REX + C6 10 ib 

MOV r/m8*** imrn8 

Ml 

Valid 

N.E. 

Move imm8 to r/m8. 

C7 10 iw 

MOV r/m 16, imm 16 

Ml 

Valid 

Valid 

Move imm 7 6 to r/m 16. 

C7 10 id 

MOV r/m32, imm32 

Ml 

Valid 

Valid 

Move imm32 to r/m32. 

REX.W + C7 10 id 

MOV r/m64, imm32 

Ml 

Valid 

N.E. 

Move imm32 sign extended to 64-bits to 
r/m64. 


MOV—Move 


Vol. 2B 4-35 










































INSTRUCTION SET REFERENCE, M-U 


NOTES: 

* The moffsB, moffs 16, moffs32 and moffs64 operands specify a simple offset relative to the segment base, where 8,16,32 and 64 
refer to the size of the data. The address-size attribute of the Instruction determines the size of the offset, either 16, 32 or 64 
bits. 

** In 32-blt mode, the assembler may Insert the 16-blt operand-size prefix with this instruction (see the following "Description" sec¬ 
tion for further information). 

***ln 64-bit mode, r/mS can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FD 

AL/AX/EAX/RAX 

Moffs 

NA 

NA 

TD 

Moffs (w) 

AL/AX/EAX/RAX 

NA 

NA 

01 

opcode + rd (w) 

imm8/16/32/64 

NA 

NA 

Ml 

ModRM:r/m (w) 

imm8/16/32/64 

NA 

NA 


Description 

Copies the second operand (source operand) to the first operand (destination operand). The source operand can be 
an immediate value, general-purpose register, segment register, or memory location; the destination register can 
be a general-purpose register, segment register, or memory location. Both operands must be the same size, which 
can be a byte, a word, a doubleword, or a quadword. 

The MOV instruction cannot be used to load the CS register. Attempting to do so results in an invalid opcode excep¬ 
tion (#UD). To load the CS register, use the far JMP, CALL, or RET instruction. 

If the destination operand is a segment register (DS, ES, FS, GS, or SS), the source operand must be a valid 
segment selector. In protected mode, moving a segment selector into a segment register automatically causes the 
segment descriptor information associated with that segment selector to be loaded into the hidden (shadow) part 
of the segment register. While loading this information, the segment selector and segment descriptor information 
is validated (see the "Operation" algorithm below). The segment descriptor data is obtained from the GDT or LDT 
entry for the specified segment selector. 

A NULL segment selector (values 0000-0003) can be loaded into the DS, ES, FS, and GS registers without causing 
a protection exception. However, any subsequent attempt to reference a segment whose corresponding segment 
register is loaded with a NULL value causes a general protection exception (#GP) and no memory reference occurs. 

Loading the SS register with a MOV instruction inhibits all interrupts until after the execution of the next instruc¬ 
tion. This operation allows a stack pointer to be loaded into the ESP register with the next instruction (MOV ESP, 
stack-pointer value) before an interrupt occurs^. Be aware that the LSS instruction offers a more efficient 
method of loading the SS and ESP registers. 

When executing MOV Reg, Sreg, the processor copies the content of Sreg to the 16 least significant bits of the 
general-purpose register. The upper bits of the destination register are zero for most IA-32 processors (Pentium 


1. If a code instruction breakpoint (for debug) is placed on an Instruction located Immediately after a MOV SS instruction, the break¬ 
point may not be triggered. However, in a sequence of instructions that load the SS register, only the first Instruction in the 
sequence is guaranteed to delay an Interrupt. 

In the following sequence, interrupts may be recognized before MOV ESP, EBP executes: 

MOV SS, EDX 
MOV SS, EAX 
MOV ESP, EBP 
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Pro processors and later) and all Intel 64 processors, with the exception that bits 31:16 are undefined for Intel 
Quark XIOOO processors, Pentium and earlier processors. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the 
beginning of this section for encoding data and limits. 

Operation 

DEST ^ SRC; 

Loading a segment register while in protected mode results in special checks and actions, as described in the 
following listing. These checks are performed on the segment selector and the segment descriptor to which it 
points. 

IF SS Is loaded 
THEN 

IF segment selector is NULL 
THEN #GP(0); FI; 

IF segment selector index is outside descriptor table limits 
or segment selector's RPL ^ CPL 
or segment is not a writable data segment 
or DPL?::CPL 

THEN #GP(selector); FI; 

IF segment not marked present 
THEN #SS(selector); 

ELSE 

SS segment selector; 

SS <- segment descriptor; FI; 

FI; 

IF DS, ES, FS, or GS is loaded with non-NULL selector 
THEN 

IF segment selector index is outside descriptor table limits 
or segment is not a data or readable code segment 
or ((segment is a data or nonconforming code segment) 
or ((RPL > DPL) and (CPL > DPL)) 

THEN #GP(selector); FI; 

IF segment not marked present 
THEN #NP(selector); 

ELSE 

SegmentRegister segment selector; 

SegmentRegister segment descriptor; FI; 

FI; 

IF DS, ES, FS, or GS is loaded with NULL selector 
THEN 

SegmentRegister segment selector; 

SegmentRegister segment descriptor; 

FI; 

Flags Affected 

None 
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Protected Mode 

#GP(0) 


#GP(selector) 


#SS(0) 

#SS(selector) 

#NP 

#PF(fault-code) 

#AC(0) 

#UD 


Exceptions 

If attempt is made to load SS register with NULL segment selector. 

If the destination operand is in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

If segment selector index is outside descriptor table limits. 

If the SS register is being loaded and the segment selector's RPL and the segment descriptor's 
DPL are not equal to the GPL. 

If the SS register is being loaded and the segment pointed to is a 
non-writable data segment. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or 
readable code segment. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or 
nonconforming code segment, but both the RPL and the GPL are greater than the DPL. 

If a memory operand effective address is outside the SS segment limit. 

If the SS register is being loaded and the segment pointed to is marked not present. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is marked not 
present. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If attempt is made to load the GS register. 

If the LOGK prefix is used. 


Real-Address Mode 

#GP 

#SS 

#UD 


Exceptions 

If a memory operand effective address is outside the GS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If attempt is made to load the GS register. 

If the LOGK prefix is used. 


Virtual-SOSe Mode 

#GP(0) 

#SS(0) 

#PF(fault-code) 

#AG(0) 

#UD 


Exceptions 

If a memory operand effective address is outside the GS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 

If attempt is made to load the GS register. 

If the LOGK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 
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64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 

If an attempt is made to load SS register with NULL segment selector when CPL = 3. 

If an attempt is made to load SS register with NULL segment selector when CPL < 3 and CPL 
^RPL. 


#GP(selector) 


#SS(0) 

#SS(selector) 

#PF(fault-code) 

#AC(0) 

#UD 


If segment selector index is outside descriptor table limits. 

If the memory access to the descriptor table is non-canonical. 

If the SS register is being loaded and the segment selector's RPL and the segment descriptor's 
DPL are not equal to the CPL. 

If the SS register is being loaded and the segment pointed to is a nonwritable data segment. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or 
readable code segment. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or 
nonconforming code segment, but both the RPL and the CPL are greater than the DPL. 

If the stack address is in a non-canonical form. 

If the SS register is being loaded and the segment pointed to is marked not present. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If attempt is made to load the CS register. 

If the LOCK prefix is used. 
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MOV—Move to/from Control Registers 


Opcode/ 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 20/r 

MOV r32, CR0-CR7 

MR 

N.E. 

Valid 

Move control register to r32. 

OF 20/r 

MOV r64, CR0-CR7 

MR 

Valid 

N.E. 

Move extended control register to r64. 

REX.R + OF 20 /O 

MOV r64, CR8 

MR 

Valid 

N.E. 

Move extended CR8 to r64 .' 

OF 22 Ir 

MOV CR0-CR7, r32 

RM 

N.E. 

Valid 

Move r32 to control register. 

OF 22 Ir 

MOV CR0-CR7, r64 

RM 

Valid 

N.E. 

Move r64 to extended control register. 

REX.R + OF 22 /O 

MOV CR8, r64 

RM 

Valid 

N.E. 

Move r64 to extended CR8.' 


NOTE: 

1. MOV CR* instructions, except for MOV CR8, are serializing Instructions. MOV CR8 is not 


architecturally defined as a serializing instruction. For more Information, see Chapter 8 in Inter 64 and IA-32 Architectures Soft¬ 
ware Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Moves the contents of a control register (CRO, CR2, CR3, CR4, or CR8) to a general-purpose register or the 
contents of a general purpose register to a control register. The operand size for these instructions is always 32 bits 
in non-64-bit modes, regardless of the operand-size attribute. (See "Control Registers" in Chapter 2 of the Intel® 
64 and IA-32 Architectures Software Developer's Manual, Volume 3A, for a detailed description of the flags and 
fields in the control registers.) This instruction can be executed only when the current privilege level is 0. 

At the opcode level, the reg field within the ModR/M byte specifies which of the control registers is loaded or read. 
The 2 bits in the mod field are ignored. The r/m field specifies the general-purpose register loaded or read. 
Attempts to reference CRl, CR5, CR6, CR7, and CR9-CR15 result in undefined opcode (#UD) exceptions. 

When loading control registers, programs should not attempt to change the reserved bits; that is, always set 
reserved bits to the value previously read. An attempt to change CR4's reserved bits will cause a general protection 
fault. Reserved bits in CRO and CR3 remain clear after any load of those registers; attempts to set them have no 
impact. On Pentium 4, Intel Xeon and P6 family processors, CRO.ET remains set after any load of CRO; attempts to 
clear this bit have no impact. 

In certain cases, these instructions have the side effect of invalidating entries in the TLBs and the paging-structure 
caches. See Section 4.10.4.1, "Operations that Invalidate TLBs and Paging-Structure Caches," in the I ntel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 3A for details. 

The following side effects are implementation-specific for the Pentium 4, Intel Xeon, and P6 processor family: when 
modifying PE or PG in register CRO, or PSE or PAE in register CR4, all TLB entries are flushed, including global 
entries. Software should not depend on this functionality in all Intel 64 or IA-32 processors. 

In 64-bit mode, the instruction's default operation size is 64 bits. The REX.R prefix must be used to access CR8. Use 
of REX.B permits access to additional registers (R8-R15). Use of the REX.W prefix or 66H prefix is ignored. Use of 
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the REX.R prefix to specify a register other than CR8 causes an invalid-opcode exception. See the summary chart 
at the beginning of this section for encoding data and limits. 

If CR4.PCIDE = 1, bit 63 of the source operand to MOV to CR3 determines whether the instruction invalidates 
entries in the TLBs and the paging-structure caches (see Section 4.10.4.1, "Operations that Invalidate TLBs and 
Paging-Structure Caches," in the I ntel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A). The 
instruction does not modify bit 63 of CR3, which is reserved and always 0. 

See "Changes to Instruction Behavior in VMX Non-Root Operation" in Chapter 25 of the I ntel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 3C, for more information about the behavior of this instruction in 
VMX non-root operation. 

Operation 

BEST ^ SRC; 

Flags Affected 

The OF, SF, ZF, AF, PF, and CF flags are undefined. 

Protected Mode Exceptions 

#GP(0) If the current privilege level is not 0. 

If an attempt is made to write invalid bit combinations in CRO (such as setting the PG flag to 1 
when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1). 

If an attempt is made to write a 1 to any reserved bit in CR4. 

If an attempt is made to write 1 to CR4.PCIDE. 

If any of the reserved bits are set in the page-directory pointers table (PDPT) and the loading 
of a control register causes the PDPT to be loaded into the processor. 

#UD If the LOCK prefix is used. 

If an attempt is made to access CRl, CR5, CR6, or CR7. 

Mode Exceptions 

If an attempt is made to write a 1 to any reserved bit in CR4. 

If an attempt is made to write 1 to CR4.PCIDE. 

If an attempt is made to write invalid bit combinations in CRO (such as setting the PG flag to 1 
when the PE flag is set to 0). 

If the LOCK prefix is used. 

If an attempt is made to access CRl, CR5, CR6, or CR7. 

Virtual-SOSe Mode Exceptions 

#GP(0) These instructions cannot be executed in virtual-8086 mode. 

Compatibility Mode Exceptions 

#GP(0) If the current privilege level is not 0. 

If an attempt is made to write invalid bit combinations in CRO (such as setting the PG flag to 1 
when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1). 

If an attempt is made to change CR4.PCIDE from 0 to 1 while CR3[11:0] OOOFI. 

If an attempt is made to clear CR0.PG[bit 31] while CR4.PCIDE = 1. 

If an attempt is made to write a 1 to any reserved bit in CR3. 

If an attempt is made to leave IA-32e mode by clearing CR4.PAE[bit 5]. 

#UD If the LOCK prefix is used. 

If an attempt is made to access CRl, CR5, CR6, or CR7. 


Real-Address 

#GP 


#UD 
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e4-Bit Mode Exceptions 


#GP(0) 


If the current privilege level is not 0. 


If an attempt is made to write invalid bit combinations in CRO (such as setting the PG flag to 1 
when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1). 

If an attempt is made to change CR4.PCIDE from 0 to 1 while CR3[11:0] 5^: OOOH. 

If an attempt is made to clear CR0.PG[bit 31]. 

If an attempt is made to write a 1 to any reserved bit in CR4. 

If an attempt is made to write a 1 to any reserved bit in CR8. 

If an attempt is made to write a 1 to any reserved bit in CR3. 

If an attempt is made to leave IA-32e mode by clearing CR4.PAE[bit 5]. 


#UD 


If the LOCK prefix is used. 

If an attempt is made to access CRl, CR5, CR6, or CR7. 

If the REX.R prefix is used to specify a register other than CR8. 
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MOV—Move to/from Debug Registers 


Opcode/ 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

0F21/r 

MOV r32, DR0-DR7 

MR 

N.E. 

Valid 

Move debug register to r3Z. 

0F21/r 

MOV r64, DR0-DR7 

MR 

Valid 

N.E. 

Move extended debug register to r64. 

OF 23 k 

MOV DR0-DR7, rSZ 

RM 

N.E. 

Valid 

Move r3Z to debug register. 

OF 23 /r 

MOV DR0-DR7, r64 

RM 

Valid 

N.E. 

Move r64 to extended debug register. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Moves the contents of a debug register (DRO, DRl, DR2, DR3, DR4, DR5, DR6, or DR7) to a general-purpose 
register or vice versa. The operand size for these instructions is always 32 bits in non-64-bit modes, regardless of 
the operand-size attribute. (See Section 17.2, "Debug Registers", of the Intel® 64 and IA-32 Architectures Soft¬ 
ware Developer's Manual, Volume 3A, for a detailed description of the flags and fields in the debug registers.) 

The instructions must be executed at privilege level 0 or in real-address mode. 

When the debug extension (DE) flag in register CR4 is clear, these instructions operate on debug registers in a 
manner that is compatible with Intel386 and Intel486 processors. In this mode, references to DR4 and DR5 refer 
to DR6 and DR7, respectively. When the DE flag in CR4 is set, attempts to reference DR4 and DR5 result in an 
undefined opcode (#UD) exception. (The CR4 register was added to the IA-32 Architecture beginning with the 
Pentium processor.) 

At the opcode level, the reg field within the ModR/M byte specifies which of the debug registers is loaded or read. 
The two bits in the mod field are ignored. The r/m field specifies the general-purpose register loaded or read. 

In 64-bit mode, the instruction's default operation size is 64 bits. Use of the REX.B prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W or 66H prefix is ignored. Use of the REX.R prefix causes an invalid- 
opcode exception. See the summary chart at the beginning of this section for encoding data and limits. 

Operation 

IF ((DE = 1) and (SRC or DEST = DR4 or DR5)) 

THEN 

#UD; 

ELSE 

DEST ^ SRC; 


FI; 

Flags Affected 

The OF, SF, ZF, AF, PF, and CF flags are undefined. 
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Protected Mode Exceptions 

#GP(0) If the current privilege level is not 0. 

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or 

DR5. 

If the LOCK prefix is used. 

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1. 

Real-Address Mode Exceptions 

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or 

DR5. 

If the LOCK prefix is used. 

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1. 

Virtual-SOSe Mode Exceptions 

#GP(0) The debug registers cannot be loaded or read when in virtual-8086 mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#GP(0) If the current privilege level is not 0. 

If an attempt is made to write a 1 to any of bits 63:32 in DR6. 

If an attempt is made to write a 1 to any of bits 63:32 in DR7. 

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or 

DR5. 

If the LOCK prefix is used. 

If the REX.R prefix is used. 

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1. 


4-44 Vol. 2B 


MOV—Move to/from Debug Registers 


INSTRUCTION SET REFERENCE, M-U 


MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 28 /r 

MOVAPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Move aligned packed double-precision floating¬ 
point values from xmm2/mem to xmmi. 

66 OF 29 /r 

MOVAPD xmm2/m128, xmmi 

MR 

v/v 

SSE2 

Move aligned packed double-precision floating¬ 
point values from xmmi to xmm2/mem. 

VEX.128.66.0F.WIG 28 /r 

VMOVAPD xmmi, xmm2/m128 

RM 

V/V 

AVX 

Move aligned packed double-precision floating¬ 
point values from xmm2/mem to xmmi. 

VEX.128.66.0F.WIC 29 /r 

VMOVAPD xmm2/m128, xmmi 

MR 

v/v 

AVX 

Move aligned packed double-precision floating¬ 
point values from xmmi to xmm2/mem. 

VEX.256.66.0F.WIG 28 /r 

VMOVAPD ymmi, ymm2/m256 

RM 

v/v 

AVX 

Move aligned packed double-precision floating¬ 
point values from ymm2/mem to ymmi. 

VEX.256.66.0F.WIG 29 /r 

VMOVAPD ymm2/m256, ymmi 

MR 

v/v 

AVX 

Move aligned packed double-precision floating¬ 
point values from ymmi to ymm2/mem. 

EVEX.128.66.0F.W1 28/r 

VMOVAPD xmmi [k1}[z],xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned packed double-precision floating¬ 
point values from xmm2/m128 to xmmi using 
writemask k1. 

EVEX.256.66.0F.W1 28 /r 

VMOVAPD ymmi {k1]{z}, ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned packed double-precision floating¬ 
point values from ymm2/m256 to ymmi using 
writemask k1. 

EVEX.512.66.0F.W1 28/r 

VMOVAPD zmmi {k1]{z}, zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move aligned packed double-precision floating¬ 
point values from zmm2/m512 to zmmi using 
writemask k1. 

EVEX.128.66.0F.W1 29/r 

VMOVAPD xmm2/m128 {k1 }{z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed double-precision floating¬ 
point values from xmmi to xmm2/m128 using 
writemask k1. 

EVEX.256.66.0F.W1 29 /r 

VMOVAPD ymm2/m256 {k1 }[z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed double-precision floating¬ 
point values from ymmi to ymm2/m256 using 
writemask k1. 

EVEX.512.66.0F.W1 29/r 

VMOVAPD zmm2/m512 [k1}[z}, zmmi 

FVM-MR 

v/v 

AVX512F 

Move aligned packed double-precision floating¬ 
point values from zmmi to zmm2/m512 using 
writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 
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Description 

Moves 2, 4 or 8 double-precision floating-point values from the source operand (second operand) to the destination 
operand (first operand). This instruction can be used to load an XMM, VMM or ZMM register from an 128-bit, 256- 
bit or 512-bit memory location, to store the contents of an XMM, VMM or ZMM register into a 128-bit, 256-bit or 
512-bit memory location, or to move data between two XMM, two VMM or two ZMM registers. 

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit 
versions), 32-byte (256-bit version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection 
exception (#GP) will be generated. For EVEX encoded versions, the operand must be aligned to the size of the 
memory operand. To move double-precision floating-point values to and from unaligned memory locations, use the 
VMOVUPD instruction. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

EVEX.512 encoded version: 

Moves 512 bits of packed double-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float64 
memory location, to store the contents of a ZMM register into a 512-bit float64 memory location, or to move data 
between two ZMM registers. When the source or destination operand is a memory operand, the operand must be 
aligned on a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision 
floating-point values to and from unaligned memory locations, use the VMOVUPD instruction. 

VEX.256 and EVEX.256 encoded versions: 

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a VMM register from a 256-bit memory 
location, to store the contents of a VMM register into a 256-bit memory location, or to move data between two VMM 
registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte 
boundary or a general-protection exception (#GP) will be generated. To move double-precision floating-point 
values to and from unaligned memory locations, use the VMOVUPD instruction. 

128-bit versions: 

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory 
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two 
XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 
16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating¬ 
point values to and from unaligned memory locations, use the VMOVUPD instruction. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding ZMM destination register remain 
unchanged. 

(E)VEX.128 encoded version: Bits (MAX_VL-1:128) of the destination ZMM register destination are zeroed. 

Operation 

VMOVAPD (EVEX encoded versions, register-copy form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[i+63:i] ^ SRC[i+63:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 
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VMOVAPD (EVEX encoded versions, store-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FORj^OTO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:l]^ SRC[i+63:i] 

ELSE 

ELSE *DEST[l+63:i] remains unchanged* ; merging-masking 


FI; 

ENDFOR; 


VMOVAPD (EVEX encoded versions, load-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:l] ^ SRC[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVAPD (VEX.256 encoded version, ioad - and register copy) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVAPD (VEX.256 encoded version, store-form) 

DEST[255:0] ^ SRC[255:0] 

VMOVAPD (VEX.128 encoded version, ioad - and register copy) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128]^0 


MOVAPD (128-bit ioad- and register-copy- form Legacy SSE version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128] (Unmodified) 


(V)MOVAPD (128-bit store-form version) 

DEST[127:0] ^ SRC[127:0] 
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Intel C/C++ Compiler Intrinsic Equivaient 

VMOVAPD _m512d _mm512Joad_pd( void * m); 

VMOVAPD_mSI 2d _mm512_mask_load_pd(_mSI 2d s,_mmaskS k, void * m); 

VMOVAPD_mSI 2d _mm512_maskz_load_pd(_mmaskS k, void * m); 

VMOVAPD void _mm512_store_pd( void * d,_mS 12d a); 

VMOVAPD void _mm512_mask_store_pd( void * d,_mmaskS k,_mSI 2d a); 

VMOVAPD_m256d _mm256_mask_load_pd(_m256d s,_mmaskS k, void * m); 

VMOVAPD_m256d _mm256_maskz_load_pd(_mmaskS k, void * m); 

VMOVAPD void _mm256_mask_store_pd( void * d,_mmaskS k,_m256d a); 

VMOVAPD_ml 2Sd _mm_mask_load_pd(_ml 2Sd s,_mmaskS k, void * m); 

VMOVAPD_ml 2Sd _mm_maskz_load_pd(_mmaskS k, void * m); 

VMOVAPD void _mm_mask_store_pd( void * d,_mmaskS k,_ml 2Sd a); 

MOVAPD _m256d _mm256Joad_pd (double * p); 

MOVAPD void _mm256_store_pd(double * p,_m256d a); 

MOVAPD_ml 2Sd _mm_load_pd (double * p); 

MOVAPD void _mm_store_pd(double * p,_ml 2Sd a); 

SIMD Fioating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel.SSE2; 
EVEX-encoded instruction, see Exceptions Type El. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 28 /r 

MOVAPS xmmi, xmnn2/nn128 

RM 

V/V 

SSE 

Move aligned packed single-precision floating-point 
values from xmm2/mem to xmmi. 

OF 29 /r 

MOVAPS xmm2/m128, xmmi 

MR 

v/v 

SSE 

Move aligned packed single-precision floating-point 
values from xmmi to xmm2/mem. 

VEX.128.0F.WIG 28 /r 

VMOVAPS xmmi, xmm2/m128 

RM 

V/V 

AVX 

Move aligned packed single-precision floating-point 
values from xmm2/mem to xmmi. 

VEX.128.0F.WIG 29/r 

VMOVAPS xmm2/m128, xmmi 

MR 

v/v 

AVX 

Move aligned packed single-precision floating-point 
values from xmmi to xmm2/mem. 

VEX.256.0F.WIG 28 /r 

VMOVAPS ymmi, ymm2/m256 

RM 

v/v 

AVX 

Move aligned packed single-precision floating-point 
values from ymm2/mem to ymmi. 

VEX.256.0F.WIG 29 /r 

VMOVAPS ymm2/m256, ymmi 

MR 

v/v 

AVX 

Move aligned packed single-precision floating-point 
values from ymmi to ymm2/mem. 

EVEX.128.0F.W0 28 /r 

VMOVAPS xmmi [k1}[z}, xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned packed single-precision floating-point 
values from xmm2/m128 to xmmi using 
writemask k1. 

EVEX.256.0F.W0 28 /r 

VMOVAPS ymmi {k1}{z}, ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned packed single-precision floating-point 
values from ymm2/m256 to ymmi using 
writemask k1. 

EVEX.512.0F.W0 28 /r 

VMOVAPS zmmi {k1}[z}, zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move aligned packed single-precision floating-point 
values from zmm2/m512 to zmmi using 
writemask k1. 

EVEX.128.0F.W0 29/r 

VMOVAPS xmm2/m128 {k1 }{z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed single-precision floating-point 
values from xmmi to xmm2/m128 using 
writemask k1. 

EVEX.256.0F.W0 29 /r 

VMOVAPS ymm2/m256 (k1 }[z], ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed single-precision floating-point 
values from ymmi to ymm2/m256 using 
writemask k1. 

EVEX.512.0F.W0 29/r 

VMOVAPS zmm2/m512 (k1 }[z}, zmmi 

FVM-MR 

v/v 

AVX512F 

Move aligned packed single-precision floating-point 
values from zmmi to zmm2/m512 using 
writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Moves 4, 8 or 16 single-precision floating-point values from the source operand (second operand) to the destina¬ 
tion operand (first operand). This instruction can be used to load an XMM, VMM or ZMM register from an 128-bit, 
256-bit or 512-bit memory location, to store the contents of an XMM, VMM or ZMM register into a 128-bit, 256-bit 
or 512-bit memory location, or to move data between two XMM, two VMM or two ZMM registers. 

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit 
version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary or a general- 
protection exception (#GP) will be generated. For EVEX.512 encoded versions, the operand must be aligned to the 
size of the memory operand. To move single-precision floating-point values to and from unaligned memory loca¬ 
tions, use the VMOVUPS instruction. 
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Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

EVEX.512 encoded version: 

Moves 512 bits of packed single-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32 
memory location, to store the contents of a ZMM register into a float32 memory location, or to move data between 
two ZMM registers. When the source or destination operand is a memory operand, the operand must be aligned on 
a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating¬ 
point values to and from unaligned memory locations, use the VMOVUPS instruction. 

VEX.256 and EVEX.256 encoded version: 

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a VMM register from a 256-bit memory 
location, to store the contents of a VMM register into a 256-bit memory location, or to move data between two VMM 
registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte 
boundary or a general-protection exception (#GP) will be generated. 

128-bit versions: 

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory 
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two 
XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 
16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating¬ 
point values to and from unaligned memory locations, use the VMOVUPS instruction. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding ZMM destination register remain 
unchanged. 

(E)VEX.128 encoded version: Bits (MAX_VL-1:128) of the destination ZMM register are zeroed. 

Operation 

VMOVAPS (EVEX encoded versions, register-copy form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i] ^ SRC[i+31 :i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VMOVAPS (EVEX encoded versions, store-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31:i]^ 

SRC[i+31:i] 

ELSE *DEST[i+31 :i] remains unchanged* ; merging-masking 
FI; 

ENDFOR; 
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VMOVAPS (EVEX encoded versions, load-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR) ^0 TO KL-1 
i^j*32 

IF k10] OR *no wrltemask* 

THEN DEST[i+31:l] ^ SRC[i+31:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVAPS (VEX.256 encoded version, load - and register copy) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVAPS (VEX.256 encoded version, store-form) 

DEST[255:0] ^ SRC[255:0] 

VMOVAPS (VEX.128 encoded version, load - and register copy) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128]^0 

MOVAPS (128-bit load- and register-copy- form Legacy SSE version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128] (Unmodified) 

(V)MOVAPS (128-bit store-form version) 

DEST[127:0] ^ SRC[127:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVAPS _m512 _mm512Joad_ps( void * m); 

VMOVAPS_m512 _mm512_mask_load_ps(_m512 s,_mmaski 6 k, void * m); 

VMOVAPS_m512 _mm512_maskz_load_ps(_mmaski 6 k, void * m); 

VMOVAPS void _mm512_store_ps( void * d,_m512 a); 

VMOVAPS void _mm512_mask_store_ps( void * d,_mmaski 6 k,_m512 a); 

VMOVAPS_m256 _mm256_mask_load_ps(_m256 a,_mmask8 k, void * s); 

VMOVAPS_m256 _mm256_maskz_load_ps(_mmask8 k, void * s); 

VMOVAPS void _mm256_mask_store_ps( void * d,_mmask8 k,_m256 a); 

VMOVAPS_ml 28 _mm_mask_load_ps(_ml 28 a,_mmask8 k, void * s); 

VMOVAPS_ml 28 _mm_maskz_load_ps(_mmask8 k, void * s); 

VMOVAPS void _mm_mask_store_ps( void * d,_mmask8 k,_ml 28 a); 

MOVAPS _m256 _mm256Joad_ps (float * p); 

MOVAPS void _mm256_store_ps(float * p,_m256 a); 

MOVAPS_ml 28 _mm_load_ps (float * p); 

MOVAPS void _mm_store_ps(float * p,_ml 28 a); 

SIMD Floating-Point Exceptions 

None 


MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values 


Vol. 2B 4-51 


INSTRUCTION SET REFERENCE, M-U 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel.SSE; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instruction, see Exceptions Type El. 
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MOVBE—Move Data After Swapping Bytes 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 38 FO /r 

MOVBEr76,m16 

RM 

Valid 

Valid 

Reverse byte order in m 7 6 and move tori 6. 

OF 38 FO /r 

MOVBE r32, m32 

RM 

Valid 

Valid 

Reverse byte order in m32 and move to r32. 

REX.W + OF 38 FO /r 

MOVBE r64, m64 

RM 

Valid 

N.E. 

Reverse byte order in m64 and move to r64. 

OF 38 FI /r 

MOVBEm76,r16 

MR 

Valid 

Valid 

Reverse byte order in rl6 and move to ml6. 

OF 38 FI /r 

MOVBE m32, r32 

MR 

Valid 

Valid 

Reverse byte order in r32 and move to m32. 

REX.W + OF 38 FI /r 

MOVBE m64, r64 

MR 

Valid 

N.E. 

Reverse byte order in r64 and move to m64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Performs a byte swap operation on the data copied from the second operand (source operand) and store the result 
in the first operand (destination operand). The source operand can be a general-purpose register, or memory loca¬ 
tion; the destination register can be a general-purpose register, or a memory location; however, both operands can 
not be registers, and only one operand can be a memory location. Both operands must be the same size, which can 
be a word, a doubleword or quadword. 

The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus 
providing support for converting little-endian values to big-endian format and vice versa. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the 
beginning of this section for encoding data and limits. 


Operation 

TEMP ^ SRC 


IF (OperandSIze = 16) 

THEN 

DEST[7:0] ^ TEMP[15:8]; 
DEST[15:8] ^ TEMP[7:0]; 
ELESIF (OperandSIze = 32) 
DEST[7:0]^TEMP[31:24]; 
DEST[15:8]^TEMP[23:16]; 
DEST[23:16]^TEMP[15:8]; 
DEST[31:23]^TEMP[7:0]; 
ELSE IF ( OperandSIze = 64) 


DEST[7:0] ^ 
DEST[15:8] ^ 
DEST[23:16] 
DEST[31:24] 
DEST[39:32] 
DEST[47:40] 
DEST[55:48] 
DEST[63:56] 


TEMP[63:56]; 

- TEMP[55:48]; 
^ TEMP[47:40]; 
^ TEMP[39:32]; 
^TEMP[31:24]; 
^TEMP[23:16]; 
^ TEMP[15:8]; 
^ TEMP[7:0]; 


FI; 
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Flags Affected 

None 

Protected Mode Exceptions 

#GP(0) If the destination operand is in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0. 

If the LOCK prefix is used. 

If REP (F3H) prefix is used. 

Mode Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If a memory operand effective address is outside the SS segment limit. 

If CPUID.01H:ECX.MOVBE[bit 22] = 0. 

If the LOCK prefix is used. 

If REP (F3H) prefix is used. 

Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If CPUID.01H:ECX.MOVBE[bit 22] = 0. 

If the LOCK prefix is used. 

If REP (F3H) prefix is used. 

If REPNE (F2H) prefix is used and CPUID.01H:ECX.SSE4_2[bit 20] = 0. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 

#SS(0) If the stack address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0. 

If the LOCK prefix is used. 

If REP (F3H) prefix is used. 


\/irtual-8086 Mode 

#GP(0) 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


Real-Address 

#GP 

#SS 

#UD 
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MOVD/MOVQ—Move Doubleword/Move Quadword 


Opcode/ 

Instruction 

Op/ En 

64/32-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

0F6E /r 

MOVD mm, r/m32 

RM 

V/V 

MMX 

Move doubleword from r/m32 to mm. 

REX.W + OF 6E /r 

MOVQ mm, r/m64 

RM 

V/N.E. 

MMX 

Move quadword from r/m64 to mm. 

0F7E Ir 

MOVD r/m32, mm 

MR 

V/V 

MMX 

Move doubleword from mm to r/m32. 

REX.W + OF 7E Ir 

MOVQ r/m64, mm 

MR 

V/N.E. 

MMX 

Move quadword from mm to r/m64. 

66 OF 6E Ir 

MOVD xmm, r/m32 

RM 

V/V 

SSE2 

Move doubleword from r/m32 to xmm. 

66 REX.W OF 6E Ir 

MOVQ xmm, r/m64 

RM 

V/N.E. 

SSE2 

Move quadword from r/m64 to xmm. 

66 OF 7E Ir 

MOVD r/m32, xmm 

MR 

V/V 

SSE2 

Move doubleword from xmm register to r/m32. 

66 REX.W OF 7E Ir 

MOVQ r/m64, xmm 

MR 

V/N.E. 

SSE2 

Move quadword from xmm register to r/m64. 

VEX.128.66.0F.W0 6E / 

VMOVD xmmi, r32/m32 

RM 

V/V 

AVX 

Move doubleword from r/m32 to xmmi. 

VEX.128.66.0F.W1 6E/r 

VMOVQ xmm 7, r64/m64 

RM 

V/N.E'. 

AVX 

Move quadword from r/m64 to xmmi. 

VEX.128.66.0F.W0 7E/r 

VMOVD^3^/m3^xmm7 

MR 

V/V 

AVX 

Move doubleword from xmm 7 register to r/m32. 

VEX.128.66.0F.W1 7E/r 

VMOVQ r64/m64, xmmi 

MR 

V/N.E'. 

AVX 

Move quadword from xmm 7 register to r/m64. 

EVEX.128.66.0F.W0 6E/r 

VMOVD xmm1,r32/m32 

T1S-RM 

V/V 

AVX512F 

Move doubleword from r/m32 to xmmi. 

EVEX.128.66.0F.W1 6E/r 

VMOVQ xmmi, r64/m64 

T1S-RM 

V/N.E.' 

AVX512F 

Move quadword from r/m64 to xmmi. 

EVEX.128.66.0F.W0 7E/r 

VMOVD r32/m32, xmmi 

T1S-MR 

V/V 

AVX512F 

Move doubleword from xmmi register to r/m32. 

EVEX.128.66.0F.W1 7E/r 

VMOVQ r64/m64, xmmi 

T1S-MR 

V/N.E.' 

AVX512F 

Move quadword from xmmi register to r/m64. 


NOTES: 

1. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the WO ver¬ 
sion is used. 
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Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

T1S-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

T1S-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Copies a doubleword from the source operand (second operand) to the destination operand (first operand). The 
source and destination operands can be general-purpose registers, MMX technology registers, XMM registers, or 
32-bit memory locations. This instruction can be used to move a doubleword to and from the low doubleword of an 
MMX technology register and a general-purpose register or a 32-bit memory location, or to and from the low 
doubleword of an XMM register and a general-purpose register or a 32-bit memory location. The instruction cannot 
be used to transfer data between MMX technology registers, between XMM registers, between general-purpose 
registers, or between memory locations. 

When the destination operand is an MMX technology register, the source operand is written to the low doubleword 
of the register, and the register is zero-extended to 64 bits. When the destination operand is an XMM register, the 
source operand is written to the low doubleword of the register, and the register is zero-extended to 128 bits. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the 
beginning of this section for encoding data and limits. 

MOVD/0 with XMM destination: 

Moves a dword/qword integer from the source operand and stores it in the low 32/64-bits of the destination XMM 
register. The upper bits of the destination are zeroed. The source operand can be a 32/64-bit register or 32/64-bit 
memory location. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding VMM destination register remain 
unchanged. Qword operation requires the use of REX.W=1. 

VEX.128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. Qword operation requires 
the use of VEX.W=1. 

EVEX.128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. Qword operation requires 
the use of EVEX.W=1. 


MOVD/0 with 32/64 reo/mem destination: 

Stores the low dword/qword of the source XMM register to 32/64-bit memory location or general-purpose register. 
Qword operation requires the use of REX.W=1, VEX.W=1, or EVEX.W=1. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

If VMOVD or VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will 
cause an #UD exception. 

Operation 

MOVD (when destination operand is MMX technology register) 

DEST[31:0]^SRC; 

DEST[63:32] ^ OOOOOOOOH; 

MOVD (when destination operand is XMM register) 

DEST[31:0]^SRC; 

DEST[127:32] ^ OOOOOOOOOOOOOOOOOOOOOOOOH; 

DEST[VLMAX-1:128] (Unmodified) 
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MOVD (when source operand is MMX technology or XMM register) 

DEST^SRC[31:0]; 

VMOVD (VEX-encoded version when destination is an XMM register) 

DEST[31:0]^SRC[31:0] 

DEST[VLMAX-1:32]^0 

MOVQ (when destination operand is XMM register) 

DEST[63:0] ^ SRC[63:0]; 

DEST[127:64] ^ OOOOOOOOOOOOOOOOH; 

DEST[VLMAX-1:128] (Unmodified) 

MOVQ (when destination operand is r/m64) 

DEST[63:0] ^ SRC[63:0]; 

MOVQ (when source operand is XMM register or r/m64) 

DEST ^ SRC[63:0]; 

VMOVQ (VEX-encoded version when destination is an XMM register) 

DEST[63:0] ^ SRC[63:0] 

DEST[VLMAX-1:64]^0 

VMOVD (EVEX-encoded version when destination is an XMM register) 

DEST[31:0] ^SRC[31:0] 

DEST[511:32] ^OH 

VMOVQ (EVEX-encoded version when destination is an XMM register) 

DEST[63:0] ^ SRC[63:0] 

DEST[511:64] ^OH 

Intel C/C-F-i- Compiler Intrinsic Equivalent 

MOVD: _m64 _mm_cvtsi32_si64 (Int i) 

MOVD: Int _mm_cvtsi64_si32 (_m64m) 

MOVD: _ml 281 _mm_cvtsi32_si128 (Int a) 

MOVD: int_mm_cvtsi128_si32 (_ml 281 a) 

MOVQ: _int64 _mm_cvtsi128_si64(_ml 28i); 

MOVQ: _ml 281 _mm_cvtsi64_si128(_int64); 

VMOVD _ml 281 _mm_cvtsi32_si128( int); 

VMOVD int _mm_cvtsi128_si32(_ml 281); 

VMOVQ _ml 28i _mm_cvtsi64_si128 (_int64); 

VMOVQ _int64 _mm_cvtsi128_si64(_ml 28i); 

VMOVQ _m1281 _mmJoadLepi64( _m128i * s); 

VMOVQ void _mm_storel_epi64(_ml 281 * d,_ml 281 s); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5. 
EVEX-encoded instruction, see Exceptions Type E9NF. 

#UD IfVEX.L=l. 

If VEX.vvvv != llllB or EVEX.vvvv != llllB. 
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MOVDDUP—Replicate Double FP Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F2 OF 12 /r 

MOVDDUP xmmi, xmm2/m64 

RM 

V/V 

SSE3 

Move double-precision floating-point value from 
xmm2/m64 and duplicate into xmmi. 

VEX.128.F2.0F.WIC 12/r 

VMOVDDUP xmmi, xmm2/m64 

RM 

v/v 

AVX 

Move double-precision floating-point value from 
xmm2/m64 and duplicate into xmmi. 

VEX.256.F2.0F.WIC 12 /r 

VMOVDDUP ymmi, ymm2/m256 

RM 

V/V 

AVX 

Move even index double-precision floating-point 
values from ymm2/mem and duplicate each element 
into ymmi. 

EVEX.128.F2.0F.W1 12/r 

VMOVDDUP xmmi {l<1}{z}, 
xmm2/m64 

DUP-RM 

v/v 

AVX512VL 

AVX512F 

Move double-precision floating-point value from 
xmm2/m64 and duplicate each element into xmmi 
subject to writemask k1. 

EVEX.256.F2.0F.W1 12 /r 

VMOVDDUP ymmi [k1}{z}, 
ymm2/m256 

DUP-RM 

v/v 

AVX512VL 

AVX512F 

Move even index double-precision floating-point 
values from ymm2/m256 and duplicate each element 
into ymmi subject to writemask k1. 

EVEX.512.F2.0F.W1 12/r 

VMOVDDUP zmmi [k1}[z}, 
zmm2/m512 

DUP-RM 

v/v 

AVX512F 

Move even index double-precision floating-point 
values from zmm2/m512 and duplicate each element 
into zmmi subject to writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

DUP-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

For 256-bit or higher versions: Duplicates even-indexed double-precision floating-point values from the source 
operand (the second operand) and into adjacent pair and store to the destination operand (the first operand). 

For 128-bit versions: Duplicates the low double-precision floating-point value from the source operand (the second 
operand) and store to the destination operand (the first operand). 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register are unchanged. The 
source operand is XMM register or a 64-bit memory location. 

VEX.128 and EVEX.128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. The source 
operand is XMM register or a 64-bit memory location. The destination is updated conditionally under the writemask 
for EVEX version. 

VEX.256 and EVEX.256 encoded version: Bits (MAX_VL-1:256) of the destination register are zeroed. The source 
operand is VMM register or a 256-bit memory location. The destination is updated conditionally under the 
writemask for EVEX version. 

EVEX.512 encoded version: The destination is updated according to the writemask. The source operand is ZMM 
register or a 512-bit memory location. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 
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Figure 4-2. VMOVDDUP Operation 


Operation 

VMOVDDUP (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

TMP_SRC[63:0] ^ SRC[63:0] 

TMP_SRC[127:64] ^ SRC[63:0] 

IFVL>=256 

TMP_SRC[191:128] ^ SRC[191:128] 

TMP_SRC[255:192] ^ SRC[191:128] 

FI; 

IFVL>=512 

TMP_SRC[319:256] ^ SRC[319:256] 

TMP_SRC[383:320] ^ SRC[319:256] 

TMP_SRC[477:384] ^ SRC[477:384] 

TMP_SRC[511:484] ^ SRC[477:384] 

FI; 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TMP_SRC[l+63:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDDUP (VEX.256 encoded version) 

DEST[63:0] eSRC[63:0] 

DEST[127:64] ^SRC[63:0] 

DEST[191:128] ^SRC[191:128] 

DEST[255:192] ^SRC[191:128] 

DEST[MAX_VL-1:256] ^0 

VMOVDDUP (VEX.128 encoded version) 

DEST[63:0] ^SRC[63:0] 

DEST[127:64] ^SRC[63:0] 

DEST[MAX_VL-1:128] ^0 
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MOVDDUP (1 Z8-bit Legacy SSE version) 

DEST[63:0] ^SRC[63:0] 

DEST[127:64] ^SRC[63:0] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVDDUP _m512d _mm512_movedup_pd( _m512d a); 

VMOVDDUP_mSI 2d_mm512_mask_movedup_pd(_mSI 2d s,_mmaskS k,_mSI 2d a); 

VMOVDDUP_mSI 2d _mm512_maskz_movedup_pd(_mmaskS k,_mSIZd a); 

VMOVDDUP_m256d _mm256_mask_movedup_pd(_m256d s,_mmaskS k,_m256d a); 

VMOVDDUP_m256d _mm256_maskz_movedup_pd(_mmaskS k,_m256d a); 

VMOVDDUP_ml 28d _mm_mask_movedup_pd(_ml 28d s,_mmask8 k,_ml 28d a); 

VMOVDDUP_ml 28d _mm_maskz_movedup_pd(_mmask8 k,_m128d a); 

MOVDDUP_m256d _mm256_movedup_pd (_m256d a); 

MOVDDUP_ml 28d _mm_movedup_pd (_ml 28d a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; 

EVEX-encoded instruction, see Exceptions Type E5NF. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MO\/DQA,\/MO\/DQA32/64—Move Aligned Packed Integer Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 6F/r 

MOVDQA xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Move aligned packed integer values from 
xmm2/mem to xmmi. 

66 0F7F/r 

MOVDQA xmm2/m128, xmmi 

MR 

v/v 

SSE2 

Move aligned packed integer values from xmmi 
to xmm2/mem. 

VEX.128.66.0F.WIG6F /r 

VMOVDQA xmmi, xmm2/m128 

RM 

V/V 

AVX 

Move aligned packed integer values from 
xmm2/mem to xmmi. 

VEX.128.66.0F.WIG7F /r 

VMOVDQA xmm2/m128, xmmi 

MR 

v/v 

AVX 

Move aligned packed integer values from xmmi 
to xmm2/mem. 

VEX.256.66.0F.WIG 6F /r 

VMOVDQA ymmi, ymm2/m256 

RM 

v/v 

AVX 

Move aligned packed integer values from 
ymm2/mem to ymmi. 

VEX.256.66.0F.WIG 7F /r 

VMOVDQA ymm2/m256, ymmi 

MR 

v/v 

AVX 

Move aligned packed integer values from ymmi 
to ymm2/mem. 

EVEX.128.66.0F.W0 6F/r 
VMOVDQA32xmm1 {k1}[z}, 
xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned packed doubleword integer values 
from xmm2/m128 to xmmi using writemask 
k1. 

EVEX.256.66.0F.W0 6F /r 
VMQVDQA32ymm1 {k1]{z}, 
ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned packed doubleword integer values 
from ymm2/m256 to ymmi using writemask 
k1. 

EVEX.512.66.0F.W0 6F/r 

VMQVDQA32 zmmi {k1]{z}, 
zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move aligned packed doubleword integer values 
from zmm2/m512 to zmmi using writemask k1. 

EVEX.1 28.66.0F.W0 7F /r 

VMQVDQA32 xmm2/m128 [k1 }{z}, 
xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed doubleword integer values 
from xmmi to xmm2/m128 using writemask 
k1. 

EVEX.256.66.0F.W0 7F /r 

VM0VDQA32 ymm2/m256 {k1]{z}, 
ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed doubleword integer values 
from ymmi to ymm2/m256 using writemask 
k1. 

EVEX.512.66.0F.W0 7F/r 

VM0VDQA32 zmm2/m512 [k1}[z}, 
zmmi 

FVM-MR 

v/v 

AVX512F 

Move aligned packed doubleword integer values 
from zmmi to zmm2/m512 using writemask k1. 

EVEX.1 28.66.0F.W1 6F /r 
VM0VDQA64xmm1 {k1}[z}, 
xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned quadword integer values from 
xmm2/m128 to xmmi using writemask k1. 

EVEX.256.66.0F.W1 6F /r 
VMQVDQA64ymm1 {k1}{z}, 
ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move aligned quadword integer values from 
ymm2/m256 to ymmi using writemask k1. 

EVEX.512.66.0F.W1 6F/r 

VMQVDQA64 zmmi {k1}{z}, 
zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move aligned packed quadword integer values 
from zmm2/m512 to zmmi using writemask k1. 

EVEX.1 28.66.0F.W1 7F /r 

VMQVDQA64 xmm2/m128 [k1 }{z}, 
xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed quadword integer values 
from xmmi to xmm2/m128 using writemask 
k1. 

EVEX.256.66.0F.W1 7F /r 

VM0VDQA64 ymm2/m256 {k1 }{z}, 
ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move aligned packed quadword integer values 
from ymmi to ymm2/m256 using writemask 
k1. 

EVEX.512.66.0F.W1 7F/r 

VM0VDQA64 zmm2/m512 {k1}{z}, 
zmmi 

FVM-MR 

v/v 

AVX512F 

Move aligned packed quadword integer values 
from zmmi to zmm2/m512 using writemask k1. 


4-62 Vol. 2B 


MO\/DQA,\/MO\/DQA32/64—Move Aligned Packed Integer Values 

























INSTRUCTION SET REFERENCE, M-U 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

EVEX encoded versions: 

Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second 
operand) to the destination operand (the first operand). This instruction can be used to load a vector register from 
an int32/int64 memory location, to store the contents of a vector register into an int32/int64 memory location, or 
to move data between two ZMM registers. When the source or destination operand is a memory operand, the 
operand must be aligned on a 16 (EVEX.128)/32(EVEX.256)/64(EVEX.512)-byte boundary or a general-protection 
exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the 
VMOVDQU instruction. 

The destination operand is updated at 32-bit (VMOVDQA32) or 64-bit (VMOVDQA64) granularity according to the 
writemask. 

VEX.256 encoded version: 

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand 
(first operand). This instruction can be used to load a VMM register from a 256-bit memory location, to store the 
contents of a VMM register into a 256-bit memory location, or to move data between two VMM registers. 

When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary 
or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory 
locations, use the VMOVDQU instruction. Bits (MAX_VL-1:256) of the destination register are zeroed. 

128-bit versions: 

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand 
(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the 
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. 

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary 
or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory 
locations, use the VMOVDQU instruction. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding ZMM destination register remain 
unchanged. 

VEX. 128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. 
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Operation 

UMOVDQABZ (EVEX encoded versions, register-copy form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ SRC[i+31 :l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE DEST[i+31:l] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VMOVDQABZ (EVEX encoded versions, store-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i]^ SRC[i+31 :i] 

ELSE *DEST[i+31 :i] remains unchanged* ; merglng-masklng 
FI; 

ENDFOR; 

VMOVDQA32 (EVEX encoded versions, ioad-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i] ^ SRC[i+31 :i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE DEST[i+31:l] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


4-64 Vol. 2B 


MOVDQA,\/MO\/DQA32/64—Move Aligned Packed Integer Values 


INSTRUCTION SET REFERENCE, M-U 


\/M0\/DQA64 (EUEX encoded versions, register-copy form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:l] ^ SRC[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

\/M0\/DQA64 (EVEX encoded versions, store-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:i]^ SRC[i+63:i] 

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking 
FI; 

ENDFOR; 

VM0VDQA64 (EVEX encoded versions, ioad-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:i] ^ SRC[i+63:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDQA (VEX.256 encoded version, ioad - and register copy) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVDQA (VEX.256 encoded version, store-form) 

DEST[255:0] ^ SRC[255:0] 

VMOVDQA (VEX.128 encoded version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128]^0 

VMOVDQA (128-bit load- and register-copy- form Legacy SSE version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128] (Unmodified) 
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(VIMOVDQA (1 Z8-bit store-form version) 

DEST[127:0] ^ SRC[127:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVDQA32 _m5121 _mm512Joad_epi32( void * sa); 

\/MO\/DQA32_mSI 21 _mm512_mask_load_epi32(_mSI 2i s,_mmaski 6 k, void * sa); 

\/MO\/DQA32_mSI 21 _mm512_maskz_load_epi32(_mmaski 6 k, void * sa); 

\/MO\/DQA32 void _mm512_store_epi32(void * d,_mSI 2i a); 

\/MOVDQA32 void_mm512_mask_store_epi32(void * d,_mmaski 6 k,_m512i a); 

\/MO\/DQA32_m256i _mm256_mask_load_epi32(_m256i s,_mmaskS k, void * sa); 

\/MO\/DQA32_m256i _mm256_maskz_load_epi32(_mmaskS k, void * sa); 

\/MO\/DQA32 void _mm256_store_epi32(void * d,_m256i a); 

\/MOVDQA32 void _mm256_mask_store_epi32(void * d,_mmaskS k,_m256i a); 

\/MO\/DQA32_ml 28i _mm_mask_load_epi32(_ml 28i s,_mmask8 k, void * sa); 

\/MO\/DQA32_ml 28i _mm_maskz_load_epi32(_mmask8 k, void * sa); 

\/MOVDQA32 void _mm_store_epi32(void * d,_ml 28i a); 

\/MO\/DQA32 void _mm_mask_store_epi32(void * d,_mmask8 k,_ml 28i a); 

VM0VDQA64_m512i _mm512Joad_epi64( void * sa); 

\/M0\/DQA64_mSI 2i _mm512_mask_load_epi64(_mSI 2i s,_mmask8 k, void * sa); 

\/M0\/DQA64_mSI 2i _mm512_maskz_load_epi64(_mmask8 k, void * sa); 

\/M0\/DQA64 void _mm512_store_epi64(void * d,_mSI 2i a); 

\/M0VDQA64 void_mm512_mask_store_epi64(void * d,_mmask8 k,_m512i a); 

\/M0\/DQA64_m256i _mm256_mask_load_epi64(_m256i s,_mmask8 k, void * sa); 

\/M0\/DQA64_m256i _mm256_maskz_load_epi64(_mmask8 k, void * sa); 

\/M0\/DQA64 void _mm256_store_epi64(void * d,_m256i a); 

\/M0VDQA64 void _mm256_mask_store_epi64(void * d,_mmask8 k,_m256i a); 

\/M0\/DQA64_ml 28i _mm_mask_load_epi64(_ml 28i s,_mmask8 k, void * sa); 

\/M0\/DQA64_ml 28i _mm_maskz_load_epi64(_mmask8 k, void * sa); 

\/M0VDQA64 void _mm_store_epi64(void * d,_ml 28i a); 

\/M0\/DQA64 void _mm_mask_store_epi64(void * d,_mmask8 k,_ml 28i a); 

MOVDQA void _m256i _mm256Joad_si256 (_m256i * p); 
M0VDQA_mm256_store_si256(_m256i *p,_m256i a); 

MOVDQA _m128i _mmJoad_si128 (_m128i * p); 

MOVDQA void _mm_store_si128(_ml 28i *p,_ml 28i a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel.SSE2; 

EVEX-encoded instruction, see Exceptions Type El. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MOVDQU,VMOVDQU8/16/32/64-Move Unaligned Packed Integer Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 6F /r 

MOVDQU xmnnl, xmm2/m128 

RM 

V/V 

SSE2 

Move unaligned packed integer values from 
xmm2/m128 to xmmi. 

F3 OF 7F /r 

MOVDQU xmm2/m128, xmmi 

MR 

v/v 

SSE2 

Move unaligned packed integer values from 
xmmi toxmm2/m128. 

VEX.128.F3.0F.WIG 6F /r 

VMOVDQU xmmi, xmm2/m128 

RM 

V/V 

AVX 

Move unaligned packed integer values from 
xmm2/m128 to xmmi. 

VEX.128.F3.0F.WIC7F/r 

VMOVDQU xmm2/m128, xmmi 

MR 

v/v 

AVX 

Move unaligned packed integer values from 
xmmi to xmm2/m128. 

VEX.256.F3.0F.WIC 6F /r 

VMOVDQU ymmi, ymm2/m256 

RM 

v/v 

AVX 

Move unaligned packed integer values from 
ymm2/m256 to ymmi. 

VEX.256.F3.0F.WIC 7F /r 

VMOVDQU ymm2/m256, ymmi 

MR 

v/v 

AVX 

Move unaligned packed integer values from 
ymmi toymm2/m256. 

EVEX.128.F2.0F.W0 6F/r 

VMOVDQU8xmm1 {k1}[z}, xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed byte integer values 
from xmm2/m128 to xmmi using writemask 
k1. 

EVEX.256.F2.0F.W0 6F /r 

VMOVDQU8ymm1 [k1}[z}, ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed byte integer values 
from ymm2/m256 to ymmi using writemask 
k1. 

EVEX.512.F2.0F.W0 6F/r 

VMOVDQU8zmm1 {k1}{z}, zmm2/m512 

FVM-RM 

v/v 

AVX512BW 

Move unaligned packed byte integer values 
from zmm2/m512 to zmmi using writemask 
k1. 

EVEX.128.F2.0F.W0 7F/r 

VMOVDQU8 xmm2/m128 {k1 }{z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed byte integer values 
from xmmi to xmm2/m128 using writemask 
k1. 

EVEX.256.F2.0F.W0 7F /r 

VMOVDQU8 ymm2/m256 {k1}[z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed byte integer values 
from ymmi to ymm2/m256 using writemask 
k1. 

EVEX.512.F2.0F.W0 7F/r 

VMOVDQU8 zmm2/m512 {k1}[z}, zmmi 

FVM-MR 

v/v 

AVX512BW 

Move unaligned packed byte integer values 
from zmmi to zmm2/m512 using writemask 
k1. 

EVEX.128.F2.0F.W1 6F/r 
VM0VDQU16xmm1 [k1}[z}, xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed word integer values 
from xmm2/m128 to xmmi using writemask 
k1. 

EVEX.256.F2.0F.W1 6F /r 
VM0VDQU16ymm1 {k1}{z}, ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed word integer values 
from ymm2/m256 to ymmi using writemask 
k1. 

EVEX.512.F2.0F.W1 6F/r 
VM0VDQU16zmm1 {k1}{z}, zmm2/m512 

FVM-RM 

v/v 

AVX512BW 

Move unaligned packed word integer values 
from zmm2/m512 to zmmi using writemask 
k1. 

EVEX.128.F2.0F.W1 7F/r 

VM0VDQU16 xmm2/m128 [k1}{z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed word integer values 
from xmmi to xmm2/m128 using writemask 
k1. 

EVEX.256.F2.0F.W1 7F /r 

VM0VDQU16 ymm2/m256 [k1}[z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512BW 

Move unaligned packed word integer values 
from ymmi to ymm2/m256 using writemask 
k1. 

EVEX.512.F2.0F.W1 7F/r 

VM0VDQU16 zmm2/m512 {k1 }{z}, zmmi 

FVM-MR 

v/v 

AVX512BW 

Move unaligned packed word integer values 
from zmmi to zmm2/m512 using writemask 
k1. 

EVEX.128.F3.0F.W0 6F/r 

VMOVDQU32 xmmi [k1}[z}, 
xmm2/mm128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed doubleword integer 
values from xmm2/m128 to xmmi using 
writemask k1. 
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Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

EVEX.256.F3.0F.W0 6F /r 

VM0VDQU32 ymmi [k1 }[z}, ymm2/m256 

FVM-RM 

V/V 

AVX512VL 

AVX512F 

Move unaligned packed doubleword integer 
values from ymm2/m256 to ymmi using 
writemask k1. 

EVEX.512.F3.0F.W0 6F /r 

VM0VDQU32 zmmi [k1}{z}, zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move unaligned packed doubleword integer 
values from zmm2/m512 to zmmi using 
writemask k1. 

EVEX.128.F3.0F.W0 7F /r 

VMOVDQU32 xmm2/m128 [k1}[z}, xmmi 

FVM-MR 

V/V 

AVX512VL 

AVX512F 

Move unaligned packed doubleword integer 
values from xmmi to xmm2/m128 using 
writemask k1. 

EVEX.256.F3.0F.W0 7F /r 

VM0VDQU32 ymm2/m256 {k1 }{z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed doubleword integer 
values from ymmi to ymm2/m256 using 
writemask k1. 

EVEX.512.F3.0F.W0 7F /r 

VM0VDQU32 zmm2/m512 {k1}{z}, zmmi 

FVM-MR 

v/v 

AVX512F 

Move unaligned packed doubleword integer 
values from zmmi to zmm2/m512 using 
writemask k1. 

EVEX.128.F3.0F.W1 6F/r 
VM0VDQU64xmm1 [k1}[z}, xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed guadword integer 
values from xmm2/m128 to xmmi using 
writemask k1. 

EVEX.256.F3.0F.W1 6F /r 

VM0VDQU64 ymmi [k1}[z}, ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed guadword integer 
values from ymm2/m256 to ymmi using 
writemask k1. 

EVEX.512.F3.0F.W1 6F/r 

VM0VDQU64 zmmi [k1}{z}, zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move unaligned packed guadword integer 
values from zmm2/m512 to zmmi using 
writemask k1. 

EVEX.1 28.F3.0F.W1 7F /r 

VM0VDQU64 xmm2/m128 [k1}[z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed guadword integer 
values from xmmi to xmm2/m128 using 
writemask k1. 

EVEX.256.F3.0F.W1 7F /r 

VM0VDQU64 ymm2/m256 {k1}{z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed guadword integer 
values from ymmi to ymm2/m256 using 
writemask k1. 

EVEX.512.F3.0F.W1 7F/r 

VM0VDQU64 zmm2/m512 {k1}{z}, zmmi 

FVM-MR 

v/v 

AVX512F 

Move unaligned packed guadword integer 
values from zmmi to zmm2/m512 using 
writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

EVEX encoded versions: 

Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand 
(the second operand) to the destination operand (first operand). This instruction can be used to load a vector 
register from a memory location, to store the contents of a vector register into a memory location, or to move data 
between two vector registers. 
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The destination operand is updated at 8-bit (VMOVDQU8), 16-bit (VMOVDQU16), 32-bit (VMOVDQU32), or 64-bit 
(VMOVDQU64) granularity according to the writemask. 

VEX.256 encoded version: 

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand 
(first operand). This instruction can be used to load a VMM register from a 256-bit memory location, to store the 
contents of a VMM register into a 256-bit memory location, or to move data between two VMM registers. 

Bits (MAX_VL-1:256) of the destination register are zeroed. 


128-bit versions : 

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand 
(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the 
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 

When the source or destination operand is a memory operand, the operand may be unaligned to any alignment 
without causing a general-protection exception (#GP) to be generated 

VEX.128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. 

Operation 

VMOVDQUB (EVEX encoded versions, register-copy form) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FORj^OTO KL-1 
i ^j*8 

IF k10] OR *no writemask* 

THEN DEST[i+7:l] ^ SRC[l+7:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE DEST[i+7:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDQUB (EVEX encoded versions, store-form) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FORj^OTO KL-1 
i ^j*8 

IF k10] OR *no writemask* 

THEN DEST[i+7:i]^ 

SRC[i+7:i] 

ELSE *DEST[i+7:i] remains unchanged* ; merging-masking 

FI; 

ENDFOR; 
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VMOVDQUS (EVEX encoded versions, load-form) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^j*8 

IF k1 [j] OR *no writemask* 

THEN DEST[l+7:i] ^ SRC[I+7:I] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[l+7:i] remains unchanged* 

ELSE DEST[i+7:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDQUI6 (EVEX encoded versions, register-copy form) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i 16 

IF k1 [j] OR *no writemask* 

THEN DEST[i+15:i] ^ SRC[i+15:1] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE DEST[i+15:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VMOVDQUI 6 (EVEX encoded versions, store-form) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k1 [j] OR *no writemask* 

THEN DEST[i+15:i]^ 

SRC[i+15:i] 

ELSE *DEST[i+15:i] remains unchanged* ; merglng-masklng 
FI; 

ENDFOR; 
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VMOVDQUI6 (EVEX encoded versions, load-form) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

FOR) ^0 TO KL-1 
i 16 

IF k10] OR *no wrltemask* 

THEN DEST[i+15:1] ^ SRC[i+15:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE DEST[i+15:1] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDQUBZ (EVEX encoded versions, register-copy form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^J*32 

IF k10] OR *no wrltemask* 

THEN DEST[i+31:i] ^ SRC[i+31:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDQU32 (EVEX encoded versions, store-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^J*32 

IF k10] OR *no wrltemask* 

THEN DEST[i+31:i]^ 

SRC[i+31:i] 

ELSE *DEST[i+31 :i] remains unchanged* ; merging-masking 
FI; 

ENDFOR; 
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VMOUDQUSZ (EVEX encoded versions, load-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ SRC[i+31 :l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

\/M0\/DQU64 (EVEX encoded versions, register-copy form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[i+63:i] ^ SRC[i-^63:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VM0VDQU64 (EVEX encoded versions, store-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[i+63:i]^ SRC[i+63:i] 

ELSE *DEST[i+63:i] remains unchanged* ; merglng-masklng 


FI; 

ENDFOR; 
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\/M0\/DQU64 (EVEX encoded versions, load-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:l] ^ SRC[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVDQU (\/EX.256 encoded version, load - and register copy) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVDQU (VEX.256 encoded version, store-form) 

DEST[255:0] ^ SRC[255:0] 

VMOVDQU (VEX.128 encoded version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128]^0 

VMOVDQU (128-bit load- and register-copy- form Legacy SSE version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128] (Unmodified) 

(V)MOVDQU (128-bit store-form version) 

DEST[127:0] ^ SRC[127:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVDQU16_m512i_mm512_mask_loadu_epi16(_m5121 s,_mmask32 k, void * sa); 

VMOVDQU16_m5121 _mm512_maskz_loadu_epi16(_mmask32 k, void * sa); 

VMOVDQU16 void_mm512_mask_storeu_epi16(void * d,_mmask32 k,_m5121 a); 

VMOVDQU16_m256i_mm256_mask_loadu_epi16(_m256i s,_mmask16 k, void * sa); 

VMOVDQU16_m256i _mm256_maskz_loadu_epi16(_mmaski 6 k, void * sa); 

VMOVDQU16 void _mm256_mask_storeu_epi16(void * d,_mmaski 6 k,_m256i a); 

VMOVDQU16 void _mm256_maskz_storeu_epi16(void * d,_mmaski 6 k); 

VMOVDQU16_ml 28i_mm_mask_loadu_epi16(_ml 28i s,_mmask8 k, void * sa); 

VMOVDQU16_ml 28i _mm_maskz_loadu_epi16(_mmask8 k, void * sa); 

VMOVDQU 16 void _mm_mask_storeu_epi16(void * d,_mmask8 k,_ml 28i a); 

VMOVDQU32 _m512i _mm512Joadu_epi32( void * sa); 

VMOVDQU32_m512i_mm512_mask_loadu_epi32(_m512i s,_mmaski 6 k, void * sa); 

VMOVDQU32_m512i _mm512_maskz_loadu_epi32(_mmaski 6 k, void * sa); 

VMOVDQU32 void _mm512_storeu_epi32(void * d,_m512i a); 

VMOVDQU32 void_mm512_mask_storeu_epi32(void * d,_mmaski 6 k,_m512i a); 

VMOVDQU32_m256i _mm256_mask_loadu_epi32(_m256i s,_mmask8 k, void * sa); 

VMOVDQU32_m256i _mm256_maskz_loadu_epi32(_mmask8 k, void * sa); 

VMOVDQU32 void _mm256_storeu_epi32(void * d,_m256i a); 

VMOVDQU32 void _mm256_mask_storeu_epi32(void * d,_mmask8 k,_m256i a); 

VMOVDQU32_ml 28i _mm_mask_loadu_epi32(_ml 28i s,_mmask8 k, void * sa); 
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VMOVDQUBZ_ml 281 _mm_maskz_loadu_epl32(_mmaskS k, void * sa); 

VMOVDQUBZ void _mm_storeu_epl32(vold * d,_ml 281 a); 

\/MOVDQU32 void _mm_mask_storeu_epi32(void * d,_mmask8 k,_ml 281 a); 

VM0VDQU64 _m512i _mm512Joadu_epi64( void * sa); 

\/M0\/DQU64_mSI 2i_mm512_mask_loadu_epi64(_mSI 21 s,_mmask8 k, void * sa); 

\/M0\/DQU64_m512i_mm512_maskz_loadu_epi64(_mmask8 k, void * sa); 

\/M0\/DQU64 void _mm512_storeu_epi64(void * d,_mSI 2i a); 

\/M0\/DQU64 void _mm512_mask_storeu_epi64(void * d,_mmask8 k,_mSI 2i a); 

\/M0\/DQU64_m256i _mm256_mask_loadu_epi64(_m256i s,_mmask8 k, void * sa); 

\/M0\/DQU64_m256i _mm256_maskz_loadu_epi64(_mmask8 k, void * sa); 

\/M0\/DQU64 void _mm256_storeu_epi64(void * d,_m256i a); 

\/M0\/DQU64 void _mm256_mask_storeu_epi64(void * d,_mmask8 k,_m256i a); 

\/M0\/DQU64_ml 28i _mm_mask_loadu_epi64(_ml 28i s,_mmask8 k, void * sa); 

\/M0\/DQU64_ml 28i _mm_maskz_loadu_epi64(_mmask8 k, void * sa); 

\/M0\/DQU64 void _mm_storeu_epi64(void * d,_ml 28i a); 

\/M0VDQU64 void _mm_mask_storeu_epi64(void * d,_mmask8 k,_ml 28i a); 

\/M0\/DQU8_mSI 2i _mm512_mask_loadu_epi8(_mSI 2i s,_mmask64 k, void * sa); 

\/M0\/DQU8_mSI 2i _mm512_maskz_loadu_epi8(_mmask64 k, void * sa); 

\/M0\/DQU8 void _mm512_mask_storeu_epi8(void * d,_mmask64 k,_mSI 2i a); 

\/M0\/DQU8_m256i _mm256_mask_loadu_epi8(_m256i s,_mmask32 k, void * sa); 

\/M0\/DQU8_m256i _mm256_maskz_loadu_epi8(_mmask32 k, void * sa); 

\/M0\/DQU8 void _mm256_mask_storeu_epi8(void * d,_mmask32 k,_m256i a); 

\/M0\/DQU8 void_mm256_maskz_storeu_epi8(void * d,_mmask32 k); 

\/M0VDQU8_ml 28i _mm_mask_loadu_epi8(_ml 28i s,_mmaski 6 k, void * sa); 

\/M0\/DQU8_ml 28i _mm_maskz_loadu_epi8(_mmaski 6 k, void * sa); 

\/M0\/DQU8 void _mm_mask_storeu_epi8(void * d,_mmaski 6 k,_ml 28i a); 

MOVDQU _m256i _mm256Joadu_si256 (_m256i * p); 
MOVDQU_mm256_storeu_si256(_m256i *p,_m256i a); 

MOVDQU _m128i _mmJoadu_si128 (_m128i * p); 

MOVDQU _mm_storeu_si128(_ml 28i *p,_ml 28i a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4; 

EVEX-encoded instruction, see Exceptions Type E4.nb. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MOVDQZQ—Move Quadword from XMM to MMX Technology Register 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F2 OF D6 /r 

MOVDQZQ mm, xmm 

RM 

Valid 

Valid 

Move low quadword from xmm to mmx 
register. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Moves the low quadword from the source operand (second operand) to the destination operand (first operand). 
The source operand is an XMM register and the destination operand is an MMX technology register. 

This instruction causes a transition from x87 FPU to MMX technology operation (that is, the x87 FPU top-of-stack 
pointer is set to 0 and the x87 FPU tag word is set to all Os [valid]). If this instruction is executed while an x87 FPU 
floating-point exception is pending, the exception is handled before the MOVDQ2Q instruction is executed. 

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15). 

Operation 

DEST ^ SRC[63:0]; 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVDQZQ: _m64 _mm_movepi64_pi64 (_ml 281 a) 

SIMD Floating-Point Exceptions 

None. 

Protected Mode Exceptions 

#NM If CR0.TS[bit 3] = 1. 

#UD If CR0.EM[bit 2] = 1. 

If CR4.0SFXSR[bit9] = 0. 

If CPUID.01H:EDX.SSE2[bit 26] = 0. 

If the LOCK prefix is used. 

#MF If there is a pending x87 FPU exception. 

Real-Address Mode Exceptions 

Same exceptions as in protected mode. 

Virtual-8086 Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 12 /r 

MOVHLPS xmmi, xmm2 

RM 

V/V 

SSE 

Move two packed single-precision floating-point values 
from high quadword of xmm2 to low quadword of xmmi. 

VEX.NDS.128.0F.WIC 12/r 
VMOVHLPS xmmi, xmm2, xmm3 

RVM 

v/v 

AVX 

Merge two packed single-precision floating-point values 
from high quadword of xmm3 and low quadword of xmm2. 

EVEX.NDS.128.0F.W0 12/r 
VMOVHLPS xmmi, xmm2, xmm3 

RVM 

V/V 

AVX512F 

Merge two packed single-precision floating-point values 
from high quadword of xmm3 and low quadword of xmm2. 


Instruction Operand Encoding^ 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

This instruction cannot be used for memory to register moves. 

128-bit two-argument form: 

Moves two packed single-precision floating-point values from the high quadword of the second XMM argument 
(second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of 
the destination operand is left unchanged. Bits (MAX_VL-1:128) of the corresponding destination register remain 
unchanged. 

128-bit and EVEX three-argument form 

Moves two packed single-precision floating-point values from the high quadword of the third XMM argument (third 
operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM 
argument (second operand) to the high quadword of the destination (first operand). Bits (MAX_VL-1:128) of the 
corresponding destination register are zeroed. 

If VMOVHLPS is encoded with VEX.L or EVEX.L'L= 1, an attempt to execute the instruction encoded with VEX.L or 
EVEX.L'L= 1 will cause an #UD exception. 

Operation 

MOVHLPS (128-bit two-argument form) 

DEST[63:0] ^ SRC[127:64] 

DEST[MAX_VL-1:64] (Unmodified) 

VMOVHLPS (128-bit three-argument form - VEX & EVEX) 

DEST[63:0] ^ SRC2[127:64] 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

Intei C/C++ Compiier Intrinsic Equivaient 

MOVHLPS_ml 28 _mm_mouehl_ps(_ml 28 a,_ml 28 b) 

SIMD Fioating-Point Exceptions 

None 


1. ModRM.MOD = OllB required 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 7; additionally 
#UD IfVEX.L=l. 

EVEX-encoded instruction, see Exceptions Type E7NM.128. 
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MOVHPD—Move High Packed Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 16/r 

MOVHPD xmm1,m64 

RM 

V/V 

SSE2 

Move double-precision floating-point value from m64 
to high guadword of xmmi. 

VEX.NDS.128.66.0F.WIG 16/r 
VMOVHPD xmm2, xmmi, m64 

RVM 

v/v 

AVX 

Merge double-precision floating-point value from m64 
and the low guadword of xmmi. 

EVEX.NDS.128.66.0F.W1 16/r 
VMOVHPD xmm2, xmmi, m64 

T1S 

V/V 

AVX512F 

Merge double-precision floating-point value from m64 
and the low guadword of xmmi. 

66 OF 17 /r 

MOVHPD m64, xmmi 

MR 

v/v 

SSE2 

Move double-precision floating-point value from high 
guadword of xmmi to m64. 

VEX.128.66.0F.WIG 17/r 

VMOVHPD m64, xmmi 

MR 

v/v 

AVX 

Move double-precision floating-point value from high 
guadword of xmmi to m64. 

EVEX.128.66.0F.W1 17/r 

VMOVHPD m64, xmmi 

T1S-MR 

v/v 

AVX512F 

Move double-precision floating-point value from high 
guadword of xmmi to m64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 

T1S-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

This instruction cannot be used for register to register or memory to memory moves. 

128-bit Legacy SSE load: 

Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the high 64- 
bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits (MAX_VL-1:128) of 
the corresponding destination register are preserved. 

VEX.128 & EVEX encoded load: 

Loads a double-precision floating-point value from the source 64-bit memory operand (the third operand) and 
stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source 
operand (second operand) are copied to the low 64-bits of the destination. Bits (MAX_VL-1:128) of the corre¬ 
sponding destination register are zeroed. 

128-blt store: 

Stores a double-precision floating-point value from the high 64-bits of the XMM register source (second operand) 
to the 64-bit memory location (first operand). 

Note: VMOVHPD (store) (VEX. 128.66.OF 17 /r) is legal and has the same behavior as the existing 66 OF 17 store. 
For VMOVHPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD. 

If VMOVHPD is encoded with VEX.L or EVEX.L'L= 1, an attempt to execute the instruction encoded with VEX.L or 
EVEX.L'L= 1 will cause an #UD exception. 
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Operation 

MOVHPD (128-bit Legacy SSE load) 

DEST[63:0] (Unmodified) 

DEST[127:64] ^ SRC[63:0] 

DEST[MAX_VL-1:128] (Unmodified) 

VMOVHPD (VEX.128 & EVEX encoded load) 

DEST[63:0] ^SRC1[63:0] 

DEST[127:64] ^ SRC2[63:0] 

DEST[MAX_VL-1:128]^0 

VMOVHPD (store) 

DEST[63:0] ^ SRC[127:64] 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVHPD_ml 28d _mm_loadh_pd (_ml 28d a, double *p) 

MOVHPD void _mm_storeh_pd (double *p,_ml 28d a) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally 
#UD IfVEX.L=l. 

EVEX-encoded instruction, see Exceptions Type E9NF. 
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MOVHPS—Move High Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 16 /r 

MOVHPS xmm1,m64 

RM 

V/V 

SSE 

Move two packed single-precision floating-point values 
from m64 to high quadword of xmmi. 

VEX.NDS.128.0F.WIG 16/r 
VMOVHPS xmm2, xmmi, m64 

RVM 

v/v 

AVX 

Merge two packed single-precision floating-point values 
from m64 and the low quadword of xmmi. 

EVEX.NDS.128.0F.W0 16/r 
VMOVHPS xmm2, xmmi, m64 

T2 

V/V 

AVX512F 

Merge two packed single-precision floating-point values 
from m64 and the low quadword of xmmi. 

OF 17 /r 

MOVHPS m64,xmm1 

MR 

v/v 

SSE 

Move two packed single-precision floating-point values 
from high quadword of xmmi to m64. 

VEX.128.0F.WIG 17/r 

VMOVHPS m64, xmmi 

MR 

v/v 

AVX 

Move two packed single-precision floating-point values 
from high quadword of xmmi to m64. 

EVEX.128.0F.W0 17/r 

VMOVHPS m64, xmmi 

T2-MR 

v/v 

AVX512F 

Move two packed single-precision floating-point values 
from high quadword of xmmi to m64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

T2 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 

T2-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

This instruction cannot be used for register to register or memory to memory moves. 

128-bit Legacy SSE load: 

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them 
in the high 64-bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits 
(MAX_VL-1:128) of the corresponding destination register are preserved. 

VEX.128 & EVEX encoded load: 

Loads two single-precision floating-point values from the source 64-bit memory operand (the third operand) and 
stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source 
operand (the second operand) are copied to the lower 64-bits of the destination. Bits (MAX_VL-1:128) of the corre¬ 
sponding destination register are zeroed. 

128-blt store: 

Stores two packed single-precision floating-point values from the high 64-bits of the XMM register source (second 
operand) to the 64-bit memory location (first operand). 

Note: VMOVHPS (store) (VEX.NDS.128.0F 17 /r) is legal and has the same behavior as the existing OF 17 store. 
For VMOVHPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD. 

If VMOVHPS is encoded with VEX.L or EVEX.L'L= 1, an attempt to execute the instruction encoded with VEX.L or 
EVEX.L'L= 1 will cause an #UD exception. 
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Operation 

MOVHPS (128-bit Legacy SSE load) 

DEST[63:0] (Unmodified) 

DEST[127:64] ^ SRC[63:0] 

DEST[MAX_VL-1:128] (Unmodified) 

VMOVHPS (VEX.128 and EVEX encoded load) 

DEST[63:0] ^SRC1[63:0] 

DEST[127:64] ^ SRC2[63:0] 

DEST[MAX_VL-1:128]^0 

VMOVHPS (store) 

DEST[63:0] ^ SRC[127:64] 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVHPS _m128 _mmJoadh_pi (_m128 a, _m64 *p) 

MOVHPS void _mm_storeh_pi (_m64 *p,_ml 28 a) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally 
#UD IfVEX.L=l. 

EVEX-encoded instruction, see Exceptions Type E9NF. 
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MOVLHPS—Move Packed Single-Precision Floating-Point Values Low to High 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 16 /r 

MOVLHPS xmmi, xmm2 

RM 

V/V 

SSE 

Move two packed single-precision floating-point values from 
low quadword of xmm2 to high quadword of xmmi. 

VEX.NDS.128.0F.WIG 16/r 
VMOVLHPS xmmi, xmm2, xmm3 

RVM 

v/v 

AVX 

Merge two packed single-precision floating-point values 
from low quadword of xmm3 and low quadword of xmm2. 

EVEX.NDS.128.0F.W0 16/r 
VMOVLHPS xmmi, xmm2, xmm3 

RVM 

V/V 

AVX512F 

Merge two packed single-precision floating-point values 
from low quadword of xmm3 and low quadword of xmm2. 


Instruction Operand Encoding^ 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

This instruction cannot be used for memory to register moves. 

128-bit two-argument form: 

Moves two packed single-precision floating-point values from the low quadword of the second XMM argument 
(second operand) to the high quadword of the first XMM register (first argument). The low quadword of the desti¬ 
nation operand is left unchanged. Bits (MAX_VL-1:128) of the corresponding destination register are unmodified. 

128-bit three-argument forms: 

Moves two packed single-precision floating-point values from the low quadword of the third XMM argument (third 
operand) to the high quadword of the destination (first operand). Copies the low quadword from the second XMM 
argument (second operand) to the low quadword of the destination (first operand^ Bits (MAX_VL-1:128) of the 
corresponding destination register are zeroed. 

If VMOVLHPS is encoded with VEX.L or EVEX.L'L= 1, an attempt to execute the instruction encoded with VEX.L or 
EVEX.L'L= 1 will cause an #UD exception. 

Operation 

MOVLHPS (128-bit two-argument form) 

DEST[63:0] (Unmodified) 

DEST[127:64] ^ SRC[63:0] 

DEST[MAX_VL-1:128] (Unmodified) 

VMOVLHPS (128-bit three-argument form - VEX & EVEX) 

DEST[63:0]^SRC1[63:0] 

DEST[127:64] ^ SRC2[63:0] 

DEST[MAX_VL-1:128]^0 

Intei C/C++ Compiier Intrinsic Equivaient 

MOVLHPS_ml 28 _mm_movelh_ps(_ml 28 a,_ml 28 b) 

SIMD Floating-Point Exceptions 

None 


1. ModRM.MOD = 011B required 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 7; additionally 
#UD IfVEX.L=l. 

EVEX-encoded instruction, see Exceptions Type E7NM.128. 


MOVLHPS—Move Packed Single-Precision Floating-Point Values Low to High 
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MOVLPD—Move Low Packed Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 12 /r 

MOVLPD xmmi, m64 

RM 

V/V 

SSE2 

Move double-precision floating-point value from m64 to 
low quadword of xmmi. 

VEX.NDS.128.66.0F.WIG 12/r 
VMOVLPD xmm2, xmmi, m64 

RVM 

v/v 

AVX 

Merge double-precision floating-point value from m64 
and the high quadword of xmmi. 

EVEX.NDS.128.66.0F.W1 12/r 
VMOVLPD xmm2, xmmi, m64 

T1S 

V/V 

AVX512F 

Merge double-precision floating-point value from m64 
and the high quadword of xmmi. 

66 OF 13/r 

MOVLPD m64, xmmi 

MR 

v/v 

SSE2 

Move double-precision floating-point value from low 
quadword of xmmi to m64. 

VEX.128.66.0F.WIG 13/r 
VMOVLPD m64, xmmi 

MR 

v/v 

AVX 

Move double-precision floating-point value from low 
quadword of xmmi to m64. 

EVEX.128.66.0F.W1 13/r 
VMOVLPD m64, xmmi 

T1S-MR 

v/v 

AVX512F 

Move double-precision floating-point value from low 
quadword of xmmi to m64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:r/m (r) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 

T1S-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

This instruction cannot be used for register to register or memory to memory moves. 

128-bit Legacy SSE load: 

Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the low 64- 
bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits (MAX_VL-1:128) of 
the corresponding destination register are preserved. 

VEX.128 & EVEX encoded load: 

Loads a double-precision floating-point value from the source 64-bit memory operand (third operand), merges it 
with the upper 64-bits of the first source XMM register (second operand), and stores it in the low 128-bits of the 
destination XMM register (first operand). Bits (MAX_VL-1:128) of the corresponding destination register are 
zeroed. 

128-blt store: 

Stores a double-precision floating-point value from the low 64-bits of the XMM register source (second operand) to 
the 64-bit memory location (first operand). 

Note: VMOVLPD (store) (VEX.128.66.OF 13 /r) is legal and has the same behavior as the existing 66 OF 13 store. 
For VMOVLPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD. 

If VMOVLPD is encoded with VEX.L or EVEX.L'L= 1, an attempt to execute the instruction encoded with VEX.L or 
EVEX.L'L= 1 will cause an #UD exception. 

Operation 

MOVLPD (128-bit Legacy SSE load) 

DEST[63:0] ^ SRC[63:0] 

DEST[MAX_VL-1:64] (Unmodified) 

VMOVLPD (VEX.128 & EVEX encoded load) 
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DEST[63:0] ^ SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

VMOVLPD (store) 

DEST[63:0] ^ SRC[63:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVLPD_ml 28d _mm_loadl_pd (_ml 28d a, double *p) 

MOVLPD void _mm_storel_pd (double *p,_ml 28d a) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally 
#UD IfVEX.L=l. 

EVEX-encoded instruction, see Exceptions Type E9NF. 


MOVLPD—Move Low Packed Double-Precision Floating-Point Value 
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MOVLPS—Move Low Packed Sing 

le-Precision Floating-Point Values 

Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 12 /r 

MOVLPS xmm1,m64 

RM 

V/V 

SSE 

Move two packed single-precision floating-point values 
from m64 to low guadword of xmmi. 

VEX.NDS.128.0F.WIG 12/r 
VMOVLPS xmm2, xmmi, m64 

RVM 

v/v 

AVX 

Merge two packed single-precision floating-point values 
from m64 and the high guadword of xmmi. 

EVEX.NDS.128.0F.W0 12/r 
VMOVLPS xmm2, xmmi, m64 

T2 

V/V 

AVX512F 

Merge two packed single-precision floating-point values 
from m64 and the high guadword of xmmi. 

OF 13/r 

MOVLPS m64, xmmi 

MR 

v/v 

SSE 

Move two packed single-precision floating-point values 
from low guadword of xmmi to m64. 

VEX.128.0F.WIG 13/r 

VMOVLPS m64, xmmi 

MR 

v/v 

AVX 

Move two packed single-precision floating-point values 
from low guadword of xmmi to m64. 

EVEX.128.0F.W0 13/r 

VMOVLPS m64, xmmi 

T2-MR 

v/v 

AVX512F 

Move two packed single-precision floating-point values 
from low guadword of xmmi to m64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

T2 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 

T2-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

This instruction cannot be used for register to register or memory to memory moves. 

128-bit Legacy SSE load: 

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them 
in the low 64-bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits 
(MAX_VL-1:128) of the corresponding destination register are preserved. 

VEX.128 & EVEX encoded load: 

Loads two packed single-precision floating-point values from the source 64-bit memory operand (the third 
operand), merges them with the upper 64-bits of the first source operand (the second operand), and stores them 
in the low 128-bits of the destination register (the first operand). Bits (MAX_VL-1:128) of the corresponding desti¬ 
nation register are zeroed. 

128-blt store: 

Loads two packed single-precision floating-point values from the low 64-bits of the XMM register source (second 
operand) to the 64-bit memory location (first operand). 

Note: VMOVLPS (store) (VEX. 128.OF 13 /r) is legal and has the same behavior as the existing OF 13 store. For 
VMOVLPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD. 

If VMOVLPS is encoded with VEX.L or EVEX.L'L= 1, an attempt to execute the instruction encoded with VEX.L or 
EVEX.L'L= 1 will cause an #UD exception. 
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Operation 

MOVLPS (128-bit Legacy SSE load) 

DEST[63:0] ^ SRC[63:0] 

DEST[MAX_VL-1:64] (Unmodified) 

VMOVLPS (VEX.128 & EVEX encoded load) 

DEST[63:0] ^ SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

VMOVLPS (store) 

DEST[63:0] ^ SRC[63:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVLPS_ml 28 _mm_loadl_pi (_ml 28 a,_m64 *p) 

MOVLPS void _mm_storel_pi (_m64 *p,_ml 28 a) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally 
#UD IfVEX.L=l. 

EVEX-encoded instruction, see Exceptions Type E9NF. 


MOVLPS—Move Low Packed Single-Precision Floating-Point Values 
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MOVMSKPD—Extract Packed Double-Precision Floating-Point Sign Mask 


Opcode/ 

Instruction 

Op/ 

En 

64/32-bit 

Mode 

CPUID 

Feature 

Fiag 

Description 

66 OF 50 /r 

MOVMSKPD reg, xmm 

RM 

V/V 

SSE2 

Extract 2-bit sign mask from xmm and store in reg. The 
upper bits of r32 or r64 are filled with zeros. 

VEX.128.66.0F.WIG50/r 

VMOVMSKPD reg, xmmZ 

RM 

v/v 

AVX 

Extract 2-bit sign mask from xmmZ and store in reg. 

The upper bits of r32 or r64 are zeroed. 

VEX.256.66.0F.WIG 50 /r 

VMOVMSKPD reg, ymmZ 

RM 

V/V 

AVX 

Extract 4-bit sign mask from ymmZ and store in reg. 

The upper bits of r32 or r64 are zeroed. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Extracts the sign bits from the packed double-precision floating-point values in the source operand (second 
operand), formats them into a 2-bit mask, and stores the mask in the destination operand (first operand). The 
source operand is an XMM register, and the destination operand is a general-purpose register. The mask is stored 
in the 2 low-order bits of the destination operand. Zero-extend the upper bits of the destination. 

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R 
prefix. The default operand size is 64-bit in 64-bit mode. 

128-bit versions: The source operand is a VMM register. The destination operand is a general purpose register. 

VEX.256 encoded version: The source operand is a VMM register. The destination operand is a general purpose 
register. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 

Operation 

(V)MOUMSKPD (128-bit versions) 

DEST[0] ^ SRC[63] 

DEST[1] ^SRC[127] 

IF DEST = r32 

THEN DEST[31:2] ^0; 

ELSE DEST[63:2] ^ 0; 

FI 

VMOVMSKPD (VEX.256 encoded version) 

DEST[0] ^ SRC[63] 

DEST[1] ^SRC[127] 

DEST[2] ^SRC[191] 

DEST[3] ^ SRC[255] 

IF DEST = r32 

THEN DEST[31:4] ^0; 

ELSE DEST[63:4] ^ 0; 

FI 
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Intel C/C++ Compiler Intrinsic Equivalent 

MOVMSKPD: int _mm_movemask_pd ( ml Z8d a) 

VMOVMSKPD: _mm256_movemask_pd(_m256d a) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 7; additionally 
#UD If VEX.vvvv iiiiB. 
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MOVMSKPS—Extract Packed Single-Precision Floating-Point Sign Mask 


Opcode/ 

Instruction 

Op/ 

En 

64/32-bit 

Mode 

CPUID 

Feature 

Fiag 

Description 

OF 50 /r 

MOVMSKPS reg, xmm 

RM 

V/V 

SSE 

Extract 4-bit sign mask from xmm and store in reg. 
The upper bits of r32 or r64 are filled with zeros. 

VEX.128.0F.WIG 50 /r 

VMOVMSKPS reg, xmmZ 

RM 

v/v 

AVX 

Extract 4-bit sign mask from xmmZ and store in reg. 
The upper bits of r32 or r64 are zeroed. 

VEX.256.0F.WIG 50 /r 

VMOVMSKPS reg, ymmZ 

RM 

V/V 

AVX 

Extract 8-bit sign mask from ymm2 and store in reg. 
The upper bits of r32 or r64 are zeroed. 


Instruction Operand Encoding^ 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Extracts the sign bits from the packed single-precision floating-point values in the source operand (second 
operand), formats them into a 4- or 8-bit mask, and stores the mask in the destination operand (first operand). 
The source operand is an XMM or VMM register, and the destination operand is a general-purpose register. The 
mask is stored in the 4 or 8 low-order bits of the destination operand. The upper bits of the destination operand 
beyond the mask are filled with zeros. 

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R 
prefix. The default operand size is 64-bit in 64-bit mode. 

128-bit versions: The source operand is a VMM register. The destination operand is a general purpose register. 

VEX.256 encoded version: The source operand is a VMM register. The destination operand is a general purpose 
register. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 

Operation 

DEST[0]^SRC[31]; 

DEST[1]^SRC[63]; 

DEST[2] ^ SRC[95]; 

DEST[3] ^ SRC[127]; 

IF DEST = r32 

THEN DEST[31:4] ^ ZeroExtend; 

ELSE DEST[63:4] ^ ZeroExtend; 

FI; 


1. ModRM.MOD = 0118 required 
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(V)MOVMSKPS (128-bit version) 

DEST[0] ^SRC[31] 

DEST[1] ^SRC[63] 

DEST[2] ^ SRC[95] 

DEST[3] ^ SRC[127] 

IF DEST = r32 

THEN DEST[31:4]^0; 

ELSE DEST[63:4] ^ 0; 

FI 


VMOVMSKPS (VEX.256 encoded version) 

DEST[0] ^SRC[31] 

DEST[1] ^SRC[63] 

DEST[2] ^ SRC[95] 

DEST[3] ^ SRC[127] 

DEST[4] ^ SRC[159] 

DEST[5] ^SRC[191] 

DEST[6] ^ SRC[223] 

DEST[7] ^ SRC[255] 

IF DEST = r32 

THEN DEST[31:8]^0; 

ELSE DEST[63:8] ^ 0; 

FI 

Intel C/C++ Compiler Intrinsic Equivalent 

int _mm_movemasl<_ps(_ml 28 a) 

lnt_mm256_movemask_ps(_m256 a) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 7; additionally 
#UD If VEX.vvvv iiiiB. 


MOVMSKPS—Extract Packed Single-Precision Floating-Point Sign Mask 
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MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature Fiag 

Description 

66 OF 38 2A /r 

MOVNTDQA xmmi, ml 28 

RM 

V/V 

SSE4_1 

Move double quadword from ml 28 to xmmi using non¬ 
temporal hint if WC memory type. 

VEX.128.66.0F38.WIG 2A/r 
VM0VNTDQAxmm1,m128 

RM 

v/v 

AVX 

Move double quadword from ml 28 to xmm using non¬ 
temporal hint if WC memory type. 

VEX.256.66.0F38.WIG 2A /r 
VM0VNTDQAymm1,m256 

RM 

V/V 

AVX2 

Move 256-bit data from m256 to ymm using non-temporal 
hint if WC memory type. 

EVEX.128.66.0F38.W0 2A/r 
VM0VNTDQAxmm1,m128 

FVM 

v/v 

AVX512VL 

AVX512F 

Move 128-bit data from ml 28 to xmm using non-temporal 
hint if WC memory type. 

EVEX.256.66.0F38.W0 2A /r 
VM0VNTDQAymm1,m256 

FVM 

v/v 

AVX512VL 

AVX512F 

Move 256-bit data from m256 to ymm using non-temporal 
hint if WC memory type. 

EVEX.512.66.0F38.W0 2A/r 
VM0VNTDQAzmm1,m512 

FVM 

v/v 

AVX512F 

Move 512-bit data from m512 to zmm using non-temporal 
hint if WC memory type. 


Instruction Operand Encoding^ 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first 
operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory 
type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an 
aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped 
and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the 
temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any 
time for any reason, for example: 

• A load operation other than a MOVNTDQA which references memory already resident in a temporary internal 
buffer. 

• A non-WC reference to memory already resident in a temporary internal buffer. 

• Interleaving of reads and writes to a single temporary internal buffer. 

• Repeated (V)MOVNTDQA loads of a particular 16-byte item in a streaming line. 

• Certain micro-architectural conditions including resource shortages, detection of 
a mis-speculation condition, and various fault conditions 

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the 
data from memory. Using this protocol, the processor 

does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into 
the cache hierarchy. The memory type of the region being read can override the non-temporal hint, if the memory 
address specified for the non-temporal read is not a WC memory region. Information on non-temporal reads and 
writes can be found in "Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the Intel® 64 and IA-32 
Architecture Software Developer's Manual, Volume 3A. 

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with 
a MFENCE instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might use 
different memory types for the referenced memory locations or to synchronize reads of a processor with writes by 
other agents in the system. A processor's implementation of the streaming load hint does not override the effective 
memory type, but the implementation of the hint is processor dependent. For example, a processor implementa- 


1. ModRM.MOD = 011B required 
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tion may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type. Alter¬ 
natively, another implementation may optimize cache reads generated by MOVNTDQA on WB memory type to 
reduce cache evictions. 

The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP. 

The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP. 

The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP. 

Operation 

MOVNTDQA (128bit- Legacy SSE form) 

DEST ^SRC 

DEST[MAX_VL-1:128] (Unmodified) 

VMOVNTDQA (VEX.128 and EVEX.128 encoded form) 

DEST ^ SRC 

DEST[MAX_VL-1:1281^0 

VMOVNTDQA (VEX.256 and EVEX.256 encoded forms) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVNTDQA (EVEX.512 encoded form) 

DEST[511:0]^SRC[511:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVNTDQA _m5121 _mm512_streamJoad_si512(void * p); 

MOVNTDQA_ml 28i _mm_stream_load_si128 (_ml 281 *p); 

VMOVNTDQA _m256i _mm_streamJoad_si256 (_m256i *p); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel; 

EVEX-encoded instruction, see Exceptions Type EINF. 

#UD If VEX.vvvv != llllB or EVEX.vvvv != llllB. 


MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint 
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MOVNTDQ—Store Packed Integers Using Non-Temporal Hint 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature Flag 

Description 

66 OF E7 /r 

MOVNTDQ ml 28, xmmi 

MR 

V/V 

SSE2 

Move packed integer values in xmmi to ml 28 using non¬ 
temporal hint. 

VEX.128.66.0F.WIGE7 /r 
VMOVNTDQ ml 28, xmmi 

MR 

v/v 

AVX 

Move packed integer values in xmmi to ml 28 using non¬ 
temporal hint. 

VEX.256.66.0F.WIG E7 /r 
VMQVNTDQ m256, ymmi 

MR 

V/V 

AVX 

Move packed integer values in ymmi to m256 using non¬ 
temporal hint. 

EVEX.128.66.0F.W0 E7 /r 
VMQVNTDQ ml 28, xmmi 

FVM 

v/v 

AVX512VL 

AVX512F 

Move packed integer values in xmmi to ml 28 using non¬ 
temporal hint. 

EVEX.256.66.0F.W0 E7 /r 
VMQVNTDQ m256, ymmi 

FVM 

v/v 

AVX512VL 

AVX512F 

Move packed integer values in zmmi to m256 using non¬ 
temporal hint. 

EVEX.512.66.0F.W0 E7 /r 
VMQVNTDQ m512, zmmi 

FVM 

v/v 

AVX512F 

Move packed integer values in zmmi to m512 using non¬ 
temporal hint. 



nstruction Operand Encoding^ 

Qp/En 

Qperand 1 

Qperand 2 

Qperand 3 

Qperand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using 
a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM 
register, VMM register or ZMM register, which is assumed to contain integer data (packed bytes, words, double- 
words, or quadwords). The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory 
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (512-bit 
version) boundary otherwise a general-protection exception (#GP) will be generated. 

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the 
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it 
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being 
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an 
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see 
"Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the IA-32 Intel Architecture Software Developer's 
Manual, Volume 1. 

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with 
the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple proces¬ 
sors might use different memory types to read/write the destination memory locations. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will 
#UD. 

Operation 

VMOVNTDQ(EVEX encoded versions) 

VL= 128,256,512 
DEST[VL-1:0]^SRC[VL-1:0] 

DEST[MAX_VL-1 :VL] ^ 0 


1. ModRM.MOD = 011B required 
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MOVNTDQ (Legacy and VEX versions) 

DEST ^ SRC 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVNTDQ void _mm512_stream_si512(void * p,_mSI 21 a); 

VMOVNTDQ void _mm256_stream_si256 (_m256i * p,_m256i a); 

MOVNTDQ void _mm_stream_si128 (_ml 28i * p,_ml 28i a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel.SSE2; 
EVEX-encoded instruction, see Exceptions Type EINF. 

#UD If VEX.vvvv != llllB or EVEX.vvvv != llllB. 


MOVNTDQ—Store Packed Integers Using Non-Temporal Hint 
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MOVNTI—Store Doubleword Using Non-Tern 

poral Hint 

Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF C3 /r 

MOVNTI m32, r32 

MR 

Valid 

Valid 

Move doubleword from r32 to m32 using non¬ 
temporal hint. 

REX.W + OF C3 /r 

MOVNTI m64, r64 

MR 

Valid 

N.E. 

Move quadword from r64 to m64 using non¬ 
temporal hint. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Moves the doubleword integer in the source operand (second operand) to the destination operand (first operand) 
using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is a 
general-purpose register. The destination operand is a 32-bit memory location. 

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the 
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it 
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being 
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an 
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see 
"Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 1. 

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with 
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors 
might use different memory types to read/write the destination memory locations. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the 
beginning of this section for encoding data and limits. 

Operation 

DEST ^ SRC; 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVNTI: void _mm_stream_si32 (int *p, Int a) 

MOVNTI: void _mm_stream_si64(_Int64 *p,_Int64 a) 

SIMD Floating-Point Exceptions 

None. 


Protected Mode Exceptions 

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments. 

#SS(0) For an illegal address in the SS segment. 

#PF(fault-code) For a page fault. 

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0. 


If the LOCK prefix is used. 
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Real-Address Mode Exceptions 

#GP If any part of the operand lies outside the effective address space from 0 to FFFFH. 

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0. 

If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

Same exceptions as in real address mode. 

#PF(fault-code) For a page fault. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) For a page fault. 

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0. 

If the LOCK prefix is used. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 


MOVNTI—Store Doubleword Using Non-Temporal Hint 
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MOVNTPD—Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 2B /r 

MOVNTPD ml 28, xmmi 

MR 

V/V 

SSE2 

Move packed double-precision values in xmmi to ml 28 using 
non-temporal hint. 

VEX.128.66.0F.WIG2B/r 
VMOVNTPD ml 28, xmmi 

MR 

v/v 

AVX 

Move packed double-precision values in xmmi to ml 28 using 
non-temporal hint. 

VEX.256.66.0F.WIG2B/r 
VMOVNTPD m256, ymmi 

MR 

V/V 

AVX 

Move packed double-precision values in ymmi to m256 using 
non-temporal hint. 

EVEX.128.66.0F.W1 2B/r 
VMOVNTPD ml 28, xmmi 

FVM 

v/v 

AVX512VL 

AVX512F 

Move packed double-precision values in xmmi to ml 28 using 
non-temporal hint. 

EVEX.256.66.0F.W1 2B/r 
VMOVNTPD m256, ymmi 

FVM 

v/v 

AVX512VL 

AVX512F 

Move packed double-precision values in ymmi to m256 using 
non-temporal hint. 

EVEX.512.66.0F.W1 2B/r 
VMOVNTPD m512, zmmi 

FVM 

v/v 

AVX512F 

Move packed double-precision values in zmmi to m512 using 
non-temporal hint. 


Instruction Operand Encoding^ 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Moves the packed double-precision floating-point values in the source operand (second operand) to the destination 
operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The 
source operand is an XMM register, VMM register or ZMM register, which is assumed to contain packed double¬ 
precision, floating-pointing data. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The 
memory operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte 
(EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated. 

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the 
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it 
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being 
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an 
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see 
"Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the IA-32 Intel Architecture Software Developer's 
Manual, Volume 1. 

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with 
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors 
might use different memory types to read/write the destination memory locations. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will 
#UD. 

Operation 

VMOVNTPD (EVEX encoded versions) 

VL= 128,256,512 
DEST[VL-1:0]^SRC[VL-1:0] 

DEST[MAX_VL-1 :VL] ^ 0 


1. ModRM.MOD = 011B required 
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MOVNTPD (Legacy and VEX versions) 

DEST ^ SRC 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVNTPD void _mm512_stream_pd(double * p,_mSI 2d a); 

VMOVNTPD void _mm256_stream_pd (double * p,_m256d a); 

MOVNTPD void _mm_stream_pd (double * p,_ml 28d a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel.SSE2; 
EVEX-encoded instruction, see Exceptions Type EINF. 

#UD If VEX.vvvv != llllB or EVEX.vvvv != llllB. 


MOVNTPD—Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint 
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MOVNTPS—Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 2B /r 

MOVNTPS ml 28, xmmi 

MR 

V/V 

SSE 

Move packed single-precision values xmmi to mem using 
non-temporal hint. 

VEX.128.0F.WIG2B/r 

VMOVNTPS ml 28, xmmi 

MR 

v/v 

AVX 

Move packed single-precision values xmmi to mem using 
non-temporal hint. 

VEX.256.0F.WIG 2B /r 

VMOVNTPS m256, ymmi 

MR 

V/V 

AVX 

Move packed single-precision values ymmi to mem using 
non-temporal hint. 

EVEX.128.0F.W0 2B/r 

VMOVNTPS ml 28, xmmi 

FVM 

v/v 

AVX512VL 

AVX512F 

Move packed single-precision values in xmmi to ml 28 
using non-temporal hint. 

EVEX.256.0F.W0 2B /r 

VMOVNTPS m256, ymmi 

FVM 

v/v 

AVX512VL 

AVX512F 

Move packed single-precision values in ymmi to m256 
using non-temporal hint. 

EVEX.512.0F.W0 2B/r 

VMOVNTPS m512, zmmi 

FVM 

v/v 

AVX512F 

Move packed single-precision values in zmmi to m512 
using non-temporal hint. 



nstruction Operand Encoding^ 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Moves the packed single-precision floating-point values in the source operand (second operand) to the destination 
operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The 
source operand is an XMM register, VMM register or ZMM register, which is assumed to contain packed single-preci¬ 
sion, floating-pointing. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory 
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 
encoded version) boundary otherwise a general-protection exception (#GP) will be generated. 

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the 
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it 
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being 
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an 
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see 
"Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the IA-32 Intel Architecture Software Developer's 
Manual, Volume 1. 

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with 
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPS instructions if multiple processors 
might use different memory types to read/write the destination memory locations. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

Operation 

VMOVNTPS (EVEX encoded versions) 

VL= 128,256,512 
DEST[VL-1:0]^SRC[VL-1:0] 

DEST[MAX_VL-1 :VL] ^ 0 


1. ModRM.MOD = 011B required 
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MOVNTPS 

DEST ^ SRC 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVNTPS void _mm512_stream_ps(float * p,_mSI 2d a); 

MOVNTPS void _mm_stream_ps (float * p,_ml 28d a); 

VMOVNTPS void _mm256_stream_ps (float * p,_m256 a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Typel.SSE; additionally 
EVEX-encoded instruction, see Exceptions Type EINF. 

#UD If VEX.vvvv != llllB or EVEX.vvvv != llllB. 


MOVNTPS—Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint 
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MOVNTQ—Store of Quadword Using Non-Temporal Hint 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF E7 /r 

MOVNTQ m64, mm 

MR 

Valid 

Valid 

Move quadword from mm to m64 using non¬ 
temporal hint. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Moves the quadword in the source operand (second operand) to the destination operand (first operand) using a 
non-temporal hint to minimize cache pollution during the write to memory. The source operand is an MMX tech¬ 
nology register, which is assumed to contain packed integer data (packed bytes, words, or doublewords). The 
destination operand is a 64-bit memory location. 

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the 
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it 
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being 
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an 
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see 
"Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 1. 

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with 
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTQ instructions if multiple processors 
might use different memory types to read/write the destination memory locations. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

DEST ^ SRC; 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVNTQ: void _mm_stream_pi(_m64 * p,_m64 a) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Table 22-8, "Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception," in the I ntel® 64 
and IA-32 Architectures Software Developer's Manual, Volume 3A. 


4-102 Vol. 2B 


MOVNTQ—Store of Quadword Using Non-Temporal Hint 















INSTRUCTION SET REFERENCE, M-U 


MOVQ—Move Quadword 


Opcode/ 

Instruction 

Op/ En 

64/32-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

OF 6F /r 

MOVQ mm, mm/m64 

RM 

V/V 

MMX 

Move quadword from mm/m64 to mm. 

0F7F/r 

MOVQ mm/m64, mm 

MR 

v/v 

MMX 

Move quadword from mm to mm/m64. 

F3 OF 7E /r 

MOVQ xmm 7, xmmZ/m64 

RM 

V/V 

SSE2 

Move quadword from xmmZ/mem64 to xmmi. 

VEX.128.F3.0F.WIG7E/r 

VMQVQ xmm 1, xmmZ/m64 

RM 

v/v 

AVX 

Move quadword from xmmZto xmmi. 

EVEX.128.F3.0F.W1 7E/r 

VMQVQ xmmi, xmm2/m64 

T1S-RM 

v/v 

AVX512F 

Move quadword from xmm2/m64 to xmmi. 

66 OF 06 /r 

MOVQ xmmZ/m64, xmm 1 

MR 

v/v 

SSE2 

Move quadword from xmm 1 to xmmZ/mem64. 

VEX.128.66.0F.WIG 06 /r 

VMOVQ xmml/m64, xmmZ 

MR 

v/v 

AVX 

Move quadword from xmmZ register to xmm 1/m64. 

EVEX.128.66.0F.W1 06/r 

VMOVQ xmmi /m64, xmm2 

T1S-MR 

v/v 

AVX512F 

Move quadword from xmm2 register to xmm1/m64. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Qperand 2 

Qperand 3 

Qperand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

T1S-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

T1S-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Copies a quadword from the source operand (second operand) to the destination operand (first operand). The 
source and destination operands can be MMX technology registers, XMM registers, or 64-bit memory locations. 
This instruction can be used to move a quadword between two MMX technology registers or between an MMX tech¬ 
nology register and a 64-bit memory location, or to move data between two XMM registers or between an XMM 
register and a 64-bit memory location. The instruction cannot be used to transfer data between memory locations. 

When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM 
register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all Os. 

In 64-bit mode and if not encoded using VEX/EVEX, use of the REX prefix in the form of REX.R permits this instruc¬ 
tion to access additional registers (XMM8-XMM15). 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD. 

If VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an 
#UD exception. 


MOVQ—Move Quadword 
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Operation 

MOVQ instruction when operating on MMX technology registers and memory locations: 
DEST ^ SRC; 

MOVQ instruction when source and destination operands are XMM registers: 
DEST[63:0] ^ SRC[63:0]; 

DEST[127:64] ^ OOOOOOOOOOOOOOOOH; 

MOVQ instruction when source operand is XMM register and destination 
operand is memory location: 

DEST ^ SRC[63:0]; 

MQVQ instruction when source operand is memory location and destination 
operand is XMM register: 

DEST[63:0] ^ SRC; 

DEST[127:64] ^ OOOOOOOOOOOOOOOOH; 

VMOVQ (VEX.NDS.128.F3.0F 7E) with XMM register source and destination: 

DEST[63:0] ^ SRC[63:0] 

DEST[VLMAX-1:64]^0 

VMOVQ (VEX.128.66.0F D6) with XMM register source and destination: 

DEST[63:0] ^ SRC[63:0] 

DEST[VLMAX-1:64]^0 

VMOVQ (7E - EVEX encoded version) with XMM register source and destination: 

DEST[63:0] ^ SRC[63:0] 

DEST[MAX_VL-1:64]^0 

VMOVQ (06 - EVEX encoded version) with XMM register source and destination: 

DEST[63:0] ^ SRC[63:0] 

DEST[MAX_VL-1:64]^0 

VMOVQ (7E) with memory source: 

DEST[63:0] ^ SRC[63:0] 

DEST[VLMAX-1:64]^0 

VMOVQ (7E - EVEX encoded version) with memory source: 

DEST[63:0] ^ SRC[63:0] 

DEST[:MAX_VL-1:64]^0 

VMOVQ (D6) with memory dest: 

DEST[63:0] ^ SRC2[63:0] 

Flags Affected 

None. 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVQ_ml 28i _mm_loadu_si64( void * s); 

VMOVQ void _mm_storeu_si64( void * d,_ml 28i s); 

MOVQ ml 28i _mm_mov_epi64(_ml 28i a) 
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SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Table 22-8, "Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception," in the I ntel® 64 
and IA-32 Architectures Software Developer's Manual, Volume 3B. 
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MOVQZDQ—Move Quadword from MMX Technology to XMM Register 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F3 OF D6 /r 

MOVQZDQ xmm, mm 

RM 

Valid 

Valid 

Move quadword from mmxto low quadword 
of xmm. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Moves the quadword from the source operand (second operand) to the low quadword of the destination operand 
(first operand). The source operand is an MMX technology register and the destination operand is an XMM register. 

This instruction causes a transition from x87 FPU to MMX technology operation (that is, the x87 FPU top-of-stack 
pointer is set to 0 and the x87 FPU tag word is set to all Os [valid]). If this instruction is executed while an x87 FPU 
floating-point exception is pending, the exception is handled before the MOVQ2DQ instruction is executed. 

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15). 

Operation 

DEST[63:0] ^ SRC[63:0]; 

DEST[127:64] ^ OOOOOOOOOOOOOOOOOH; 

Intel C/C++ Compiler Intrinsic Equivalent 

MOVQZDQ: _128i _mm_movpi64_pi64 (_m64 a) 

SIMD Floating-Point Exceptions 

None. 

Protected Mode Exceptions 

#NM If CR0.TS[bit 3] = 1. 

#UD If CR0.EM[bit 2] = 1. 

If CR4.0SFXSR[bit9] = 0. 

If CPUID.01H:EDX.SSE2[bit 26] = 0. 

If the LOCK prefix is used. 

#MF If there is a pending x87 FPU exception. 

Real-Address Mode Exceptions 

Same exceptions as in protected mode. 

Virtual-808e Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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MOVS/MOVSB/MOVSW/MOVSD/MOVSQ-Move Data from String to String 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

A4 

MOVS m8, m8 

NP 

Valid 

Valid 

For legacy mode. Move byte from address 
DS:(E)SI to ES:(E)DI. For 64-bit mode move 
byte from address (R|E)SI to (R|E)DI. 

A5 

MOVSm76,m76 

NP 

Valid 

Valid 

For legacy mode, move word from address 
DS:(E)SI to ES:(E)DI. For 64-bit mode move 
word at address (R|E)SI to (R|E)DI. 

A5 

MOVS m32, m32 

NP 

Valid 

Valid 

For legacy mode, move dword from address 
DS:(E)SI to ES:(E)DI. For 64-bit mode move 
dword from address (R|E)SI to (R|E)DI. 

REX.W + AS 

MOVS m64, m64 

NP 

Valid 

N.E. 

Move qword from address (R|E)SI to (R|E)DI. 

A4 

MOVSB 

NP 

Valid 

Valid 

For legacy mode. Move byte from address 
DS:(E)SI to ES:(E)DI. For 64-bit mode move 
byte from address (R|E)SI to (R|E)DI. 

AS 

MOVSW 

NP 

Valid 

Valid 

For legacy mode, move word from address 
DS:(E)SI to ES:(E)DI. For 64-bit mode move 
word at address (R|E)SI to (R|E)DI. 

AS 

MOVSD 

NP 

Valid 

Valid 

For legacy mode, move dword from address 
DS:(E)SI to ES:(E)DI. For 64-bit mode move 
dword from address (R|E)SI to (R|E)DI. 

REX.W + AS 

MOVSQ 

NP 

Valid 

N.E. 

Move qword from address (R|E)SI to (R|E)DI. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Moves the byte, word, or doubleword specified with the second operand (source operand) to the location specified 
with the first operand (destination operand). Both the source and destination operands are located in memory. The 
address of the source operand is read from the DS:ESI or the DS:SI registers (depending on the address-size attri¬ 
bute of the instruction, 32 or 16, respectively). The address of the destination operand is read from the ES:EDI or 
the ES:DI registers (again depending on the address-size attribute of the instruction). The DS segment may be 
overridden with a segment override prefix, but the ES segment cannot be overridden. 

At the assembly-code level, two forms of this instruction are allowed: the "explicit-operands" form and the "no¬ 
operands" form. The explicit-operands form (specified with the MOVS mnemonic) allows the source and destination 
operands to be specified explicitly. Here, the source and destination operands should be symbols that indicate the 
size and location of the source value and the destination, respectively. This explicit-operands form is provided to 
allow documentation; however, note that the documentation provided by this form can be misleading. That is, the 
source and destination operand symbols must specify the correct type (size) of the operands (bytes, words, or 
doublewords), but they do not have to specify the correct location. The locations of the source and destination 
operands are always specified by the DS:(E)SI and ES:(E)DI registers, which must be loaded correctly before the 
move string instruction is executed. 

The no-operands form provides "short forms" of the byte, word, and doubleword versions of the MOVS instruc¬ 
tions. Here also DS:(E)SI and ES:(E)DI are assumed to be the source and destination operands, respectively. The 
size of the source and destination operands is selected with the mnemonic: MOVSB (byte move), MOVSW (word 
move), or MOVSD (doubleword move). 

After the move operation, the (E)SI and (E)DI registers are incremented or decremented automatically according 
to the setting of the DF flag in the EFLAGS register. (If the DF flag is 0, the (E)SI and (E)DI register are incre- 
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merited; if the DF flag is 1, the (E)SI and (E)DI registers are decremented.) The registers are incremented or 
decremented by 1 for byte operations, by 2 for word operations, or by 4 for doubleword operations. 

NOTE 

To improve performance, more recent processors support modifications to the processor's 
operation during the string store operations initiated with MOVS and MOVSB. See Section 7.3.9.3 
in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1 for additional 
information on fast-string operation. 

The MOVS, MOVSB, MOVSW, and MOVSD instructions can be preceded by the REP prefix (see "REP/REPE/REPZ 
/REPNE/REPNZ—Repeat String Operation Prefix" for a description of the REP prefix) for block moves of ECX bytes, 
words, or doublewords. 

In 64-bit mode, the instruction's default address size is 64 bits, 32-bit address size is supported using the prefix 
67H. The 64-bit addresses are specified by RSI and RDI; 32-bit address are specified by ESI and EDI. Use of the 
REX. W prefix promotes doubleword operation to 64 bits. See the summary chart at the beginning of this section for 
encoding data and limits. 

Operation 

BEST ^ SRC; 

Non-64-bit Mode: 

IF (Byte move) 

THENIFDF = 0 


THEN 


(E)SI ^ 

-(E)SI + 1; 

(E)DI f 

-(E)DI-i-l; 

ELSE 


(E)SI ^ 

-(E)SI-1; 

(E)DI f 

-(E)DI- 1; 

FI; 



ELSE IF (Word move) 


THEN IF DF 

= 0 

(E)SI ^ 

(E)SI + 2; 

(E)DI ^ 

- (E)DI H- 2; 

FI; 


ELSE 


(E)SI ^ 

1 

UJ 

(E)DI ^ 

-(E)DI-2; 

FI; 



ELSE IF (Doubleword move) 


THEN IF DF 

= 0 

(E)SI ^ 

(E)SI -t 4; 

(E)DI ^ 

- (E)DI -t 4; 

FI; 


ELSE 


(E)SI ^ 

1 

UJ 

(E)DI ^ 

1 

UJ 

FI; 



FI; 

64-blt Mode: 

IF (Byte move) 
THENIFDF = 0 
THEN 
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(R|E)SI f 

-(R|E)SI-r1; 

(R|E)DI 4 

ELSE 

(R|E)SI f 
(R|E)DI 4 

-(R|E)DI-r1; 

-(R|E)SI-1; 

-(R|E)DI- 1; 


FI; 

ELSE IF (Word move) 
THENIFDF=0 


(R|E)SI f 

- (R|E)SI -r 2; 

(R|E)DI4 
FI; 

ELSE 

(R|E)SI f 
(R|E)DI 4 

- (R|E)DI -r 2; 

-(R|E)SI-2; 

- (R|E)DI - 2; 


FI; 

ELSE IF (Doubleword move) 
THENIFDF = 0 

(R|E)SI ^ (R|E)SI + 4; 


(R|E)DI4 

- (R|E)DI -r 4; 

FI; 

ELSE 

(R|E)SI f 
(R|E)DI 4 

- (R|E)SI - 4; 

- (R|E)DI - 4; 


FI; 

ELSE IF (Quadword move) 
THENIFDF = 0 


(R|E)SI f 

- (R|E)SI -r 8; 

(R|E)DI 4 
FI; 

ELSE 

(R|E)SI f 
(R|E)DI 4 
FI; 

FI; 

- (R|E)DI -t 8; 

- (R|E)SI - 8; 

- (R|E)DI - 8; 

Flags Affected 

None 



Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 


#SS(0) 

#PF(fault-code) 

#AC(0) 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If the LOCK prefix is used. 


Real-Address Mode Exceptions 


#GP 

#SS 

#UD 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If the LOCK prefix is used. 
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Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 
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MOVSD—Move or Merge Scalar Double-Precision Floating 

-Point Value 

Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F2 OF 10 /r 

MOVSD xmmi, xmm2 

RM 

V/V 

SSE2 

Move scalar double-precision floating-point value 
from xmm2 to xmmi register. 

F2 OF 10 /r 

MOVSD xmmi, m64 

RM 

v/v 

SSE2 

Load scalar double-precision floating-point value 
from m64 to xmmi register. 

F2 OF 11 /r 

MOVSD xmmi /m64, xmm2 

MR 

V/V 

SSE2 

Move scalar double-precision floating-point value 
from xmm2 register to xmm1/m64. 

VEX.NDS.LIG.F2.0F.WIC 10/r 

VMOVSD xmmi, xmm2, xmm3 

RVM 

v/v 

AVX 

Merge scalar double-precision floating-point value 
from xmm2 and xmm3 to xmmi register. 

VEX.LIG.F2.0F.WIG 10/r 

VMOVSD xmm1,m64 

XM 

v/v 

AVX 

Load scalar double-precision floating-point value 
from m64 to xmmi register. 

VEX.NDS.LIG.F2.0F.WIG 11 /r 

VMOVSD xmmi, xmm2, xmm3 

MVR 

v/v 

AVX 

Merge scalar double-precision floating-point value 
from xmm2 and xmm3 registers to xmmi. 

VEX.LIG.F2.0F.WIG 11 /r 

VMOVSD m64, xmmi 

MR 

v/v 

AVX 

Store scalar double-precision floating-point value 
from xmmi register to m64. 

EVEX.NDS.LIG.F2.0F.W1 10/r 

VMOVSD xmmi {k1}{z}, xmm2, xmm3 

RVM 

v/v 

AVX512F 

Merge scalar double-precision floating-point value 
from xmm2 and xmm3 registers to xmmi under 
writemask k1. 

EVEX.LIG.F2.0F.W1 10/r 

VMOVSD xmmi {k1}{z},m64 

T1S-RM 

v/v 

AVX512F 

Load scalar double-precision floating-point value 
from m64 to xmmi register under writemask k1. 

EVEX.NDS.LIG.F2.0F.W1 11 /r 

VMOVSD xmmi {k1}{z}, xmm2, xmm3 

MVR 

v/v 

AVX512F 

Merge scalar double-precision floating-point value 
from xmm2 and xmm3 registers to xmmi under 
writemask k1. 

EVEX.LIG.F2.0F.W1 11 /r 

VMOVSD m64 {k1}, xmmi 

T1S-MR 

v/v 

AVX512F 

Store scalar double-precision floating-point value 
from xmmi register to m64 under writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

XM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MVR 

ModRM:r/m (w) 

vvvv (r) 

ModRM:reg (r) 

NA 

T1S-RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

T1S-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 
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Description 

Moves a scalar double-precision floating-point value from the source operand (second operand) to the destination 
operand (first operand). The source and destination operands can be XMM registers or 64-bit memory locations. 
This instruction can be used to move a double-precision floating-point value to and from the low quadword of an 
XMM register and a 64-bit memory location, or to move a double-precision floating-point value between the low 
quadwords of two XMM registers. The instruction cannot be used to transfer data between memory locations. 

Legacy version: When the source and destination operands are XMM registers, bits MAX_VL:64 of the destination 
operand remains unchanged. When the source operand is a memory location and destination operand is an XMM 
registers, the quadword at bits 127:64 of the destination operand is cleared to all Os, bits MAX_VL:128 of the desti¬ 
nation operand remains unchanged. 

VEX and EVEX encoded register-register syntax: Moves a scalar double-precision floating-point value from the 
second source operand (the third operand) to the low quadword element of the destination operand (the first 
operand). Bits 127:64 of the destination operand are copied from the first source operand (the second operand). 
Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. 

VEX and EVEX encoded memory store syntax: When the source operand is a memory location and destination 
operand is an XMM registers, bits MAX_VL:64 of the destination operand is cleared to all Os. 

EVEX encoded versions: The low quadword of the destination is updated according to the writemask. 

Note: For VMOVSD (memory store and load forms), VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, 
otherwise instruction will #UD. 

Operation 

VMOVSD (EVEX.NDS.LIG.FZ.OF 10 /r: VMOVSD xmmi, ni64 with support for 32 registers) 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ SRC[63:0] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[63:0] ^ 0 
FI; 

FI; 

DEST[511:64] eO 

VMOVSD (EVEX.NDS.LIG.FZ.OF 11 /r: VMOVSD m64, xmmi with support for 32 registers) 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ SRC[63:0] 

ELSE *DEST[63:0] remains unchanged* ; merging-masking 

FI; 

VMOVSD (EVEX.NDS.LIG.FZ.OF 11 /r: VMOVSD xmmi, xmmZ, xmm3) 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ SRC2[63:0] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[63:0] ^ 0 
FI; 

FI; 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 
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MOVSD (128-bit Legacy SSE version: MOVSD XMM1, XMM2) 

DEST[63:0] ^SRC[63:0] 

DEST[MAX_VL-1:64] (Unmodified) 

VMOVSD (VEX.NDS.128.F2.0F 11 /r: VMOVSD xmmi, xmm2, xmm3) 

DEST[63:0] ^SRC2[63:0] 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

VMOVSD (VEX.NDS.128.F2.0F 10 /r: VMOVSD xmmi, xmm2, xmm3) 

DEST[63:0] ^SRC2[63:0] 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

VMOVSD (VEX.NDS.128.F2.0F 10 /r: VMOVSD xmmi, m64) 

DEST[63:0] ^SRC[63:0] 

DEST[MAX_VL-1:64] ^0 

MOVSD/VMOVSD (128-bit versions: MOVSD m64, xmmi or VMOVSD m64, xmmi) 

DEST[63:0] ^SRC[63:0] 

MOVSD (128-bit Legacy SSE version: MOVSD XMMI, m64) 

DEST[63:0] ^SRC[63:0] 

DEST[127:64] ^0 
DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVSD_ml 28d _mm_masl<_load_sd(_ml 28d s,_mmask8 k, double * p); 

VMOVSD_ml 28d _mm_maskz_load_sd(_mmask8 k, double * p); 

VMOVSD_ml 28d _mm_mask_move_sd(_ml 28d sh,_mmask8 k,_ml 28d si,_ml 28d a); 

VMOVSD_ml 28d _mm_maskz_move_sd(_mmask8 k,_ml 28d s,_ml 28d a); 

VMOVSD void _mm_mask_store_sd(double * p,_mmask8 k,_ml 28d s); 

MOVSD_ml 28d _mm_load_sd (double *p) 

MOVSD void _mm_store_sd (double *p,_ml 28d a) 

MOVSD_ml 28d _mm_move_sd (_ml 28d a,_ml 28d b) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instruction, see Exceptions Type ElO. 
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MOVSHDUP—Replicate Single FP Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 16/r 

MOVSHDUP xmmi, xmm2/m128 

RM 

V/V 

SSE3 

Move odd index single-precision floating-point values from 
xmm2/mem and duplicate each element into xmmi. 

VEX.128.F3.0F.WIG 16/r 
VMOVSHDUP xmmi, xmm2/m128 

RM 

v/v 

AVX 

Move odd index single-precision floating-point values from 
xmm2/mem and duplicate each element into xmmi. 

VEX.256.F3.0F.WIG 16 /r 
VMOVSHDUP ymmi, ymm2/m256 

RM 

V/V 

AVX 

Move odd index single-precision floating-point values from 
ymm2/mem and duplicate each element into ymmi. 

EVEX.128.F3.0F.W0 16/r 
VMOVSHDUP xmmi [k1}[z], 
xmm2/m128 

FVM 

v/v 

AVX512VL 

AVX512F 

Move odd index single-precision floating-point values from 
xmm2/m128 and duplicate each element into xmmi under 
writemask. 

EVEX.256.F3.0F.W0 16 /r 
VMOVSHDUP ymmi {l<1]{z}, 
ymm2/m256 

FVM 

v/v 

AVX512VL 

AVX512F 

Move odd index single-precision floating-point values from 
ymm2/m256 and duplicate each element into ymmi under 
writemask. 

EVEX.512.F3.0F.W0 16/r 
VMOVSHDUP zmmi {k1]{z}, 
zmm2/m512 

FVM 

v/v 

AVX512F 

Move odd index single-precision floating-point values from 
zmm2/m512 and duplicate each element into zmmi under 
writemask. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Duplicates odd-indexed single-precision floating-point values from the source operand (the second operand) to 
adjacent element pair in the destination operand (the first operand). See Figure 4-3. The source operand is an 
XMM, VMM or ZMM register or 128, 256 or 512-bit memory location and the destination operand is an XMM, VMM 
or ZMM register. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 
VEX.128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. 

VEX.256 encoded version: Bits (MAX_VL-1:256) of the destination register are zeroed. 

EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask. 
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 



Figure 4-3. MOVSHDUP Operation 


Operation 

VMOVSHDUP (EVEX encoded versions) 
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(KL, VL) = (4,128), (8, 256), (16, 512) 

TMP_SRC[31:0] ^ SRC[63:32] 

TMP_SRC[63:32] ^ SRC[63:32] 

TMP_SRC[95:64] ^ SRC[127:96] 

TMP_SRC[127:96] ^ SRC[127:96] 

IFVL>=256 

TMP_SRC[159:128] ^ SRC[191:160] 

TMP_SRC[191:160] ^ SRC[191:160] 

TMP_SRC[223:192] ^ SRC[255:224] 

TMP_SRC[255:224] ^ SRC[255:224] 

FI; 

IFVL>=512 

TMP_SRC[287:256] ^ SRC[319:288] 

TMP_SRC[319:288] ^ SRC[319:288] 

TMP_SRC[351:320] ^ SRC[383:352] 

TMP_SRC[383:352] ^ SRC[383:352] 

TMP_SRC[415:384] ^ SRC[447:416] 

TMP_SRC[447:416] ^ SRC[447:416] 

TMP_SRC[479:448] ^ SRC[511:480] 

TMP_SRC[511:480] ^ SRC[511:480] 

FI; 

FOR] ^0 TO KL-1 
i^J*32 

IF k10] OR *no writemask* 

THEN DEST[i+31 :l] ^ TMP_SRC[I+31 :l] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE ; zeroIng-maskIng 

DEST[I+31:I]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMOVSHDUP (VEX.256 encoded version) 

DEST[31:0]^SRC[63:32] 

DEST[63:32] ^ SRC[63:32] 

DEST[95:64] ^ SRC[127:96] 

DEST[127:96] ^ SRC[127:96] 

DEST[159:128] ^ SRC[191:160] 
DEST[191:160]^SRC[191:160] 
DEST[223:192] ^ SRC[255:224] 
DEST[255:224] ^ SRC[255:224] 
DEST[MAX_VL-1:256]^0 


VMOVSHDUP (VEX.128 encoded version) 

DEST[31:0] ^SRC[63:32] 

DEST[63:32] ^ SRC[63:32] 

DEST[95:64] ^ SRC[127:96] 

DEST[127:96] ^ SRC[127:96] 
DEST[MAX_VL-1:128]^0 
MOVSHDUP (128-bit Legacy SSE version) 
DEST[31:0] ^SRC[63:32] 
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DEST[63:32] ^SRC[63:32] 

DEST[95:64] ^SRC[127:96] 

DEST[127:96] ^SRC[127:96] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVSHDUP _m512 _mm512_movehdup_ps( _m512 a); 

VMOVSHDUP_mSI 2 _mm512_mask_movehdup_ps(_mSI 2 s,_mmasl<16 k,_mSI 2 a); 

VMOVSHDUP_mSI 2 _mm512_maskz_movehdup_ps(_mmask16 k,_m512 a); 

VMOVSHDUP_m256 _mm256_mask_movehdup_ps(_m256 s,_mmaskS k,_m256 a); 

VMOVSHDUP_m256 _mm256_maskz_movehdup_ps(_mmaskS k,_m256 a); 

VMOVSHDUP_ml 28 _mm_mask_movehdup_ps(_ml 28 s,_mmask8 k,_ml 28 a); 

VMOVSHDUP_ml 28 _mm_maskz_movehdup_ps(_mmask8 k,_ml 28 a); 

VMOVSHDUP _m256 _mm256_movehdup_ps (_m256 a); 

VMOVSHDUP_ml 28 _mm_movehdup_ps (_ml 28 a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4; 

EVEX-encoded instruction, see Exceptions Type E4NF.nb. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MOVSLDUP—Replicate Single FP Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 12 /r 

MOVSLDUP xmmi, xmm2/m128 

A 

V/V 

SSE3 

Move even index single-precision floating-point values from 
xmm2/mem and duplicate each element into xmmi. 

VEX.128.F3.0F.WIG 12/r 
VMOVSLDUP xmmi, xmm2/m128 

RM 

v/v 

AVX 

Move even index single-precision floating-point values from 
xmm2/mem and duplicate each element into xmmi. 

VEX.256.F3.0F.WIG 12/r 
VMOVSLDUP ymmi, ymm2/m256 

RM 

V/V 

AVX 

Move even index single-precision floating-point values from 
ymm2/mem and duplicate each element into ymmi. 

EVEX.128.F3.0F.W0 12/r 
VMOVSLDUP xmmi [k1}[z], 
xmm2/m128 

FVM 

v/v 

AVX512VL 
AVX512F 

Move even index single-precision floating-point values from 
xmm2/m128 and duplicate each element into xmmi under 
writemask. 

EVEX.256.F3.0F.W0 12 /r 
VMOVSLDUP ymmi {k1]{z}, 
ymm2/m256 

FVM 

v/v 

AVX512VL 

AVX512F 

Move even index single-precision floating-point values from 
ymm2/m256 and duplicate each element into ymmi under 
writemask. 

EVEX.512.F3.0F.W0 12/r 
VMOVSLDUP zmmi {k1]{z}, 
zmm2/m512 

FVM 

v/v 

AVX512F 

Move even index single-precision floating-point values from 
zmm2/m512 and duplicate each element into zmmi under 
writemask. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Duplicates even-indexed single-precision floating-point values from the source operand (the second operand). See 
Figure 4-4. The source operand is an XMM, VMM or ZMM register or 128, 256 or 512-bit memory location and the 
destination operand is an XMM, VMM or ZMM register. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 
VEX. 128 encoded version: Bits (MAX_VL-1:128) of the destination register are zeroed. 

VEX.256 encoded version: Bits (MAX_VL-1:256) of the destination register are zeroed. 

EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask. 
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 



Figure 4-4. MOVSLDUP Operation 
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Operation 

VMOVSLDUP (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

TMP_SRC[31:0]^SRC[31:0] 

TMP_SRC[63:32] ^ SRC[31:0] 

TMP_SRC[95:64] ^ SRC[95:64] 

TMP_SRC[127:96] ^ SRC[95:64] 

IFVL>=256 

TMP_SRC[159:128] ^ SRC[159:128] 

TMP_SRC[191:160] ^ SRC[159:128] 

TMP_SRC[223:192] ^ SRC[223:192] 

TMP_SRC[255:224] ^ SRC[223:192] 

FI; 

IFVL>=512 

TMP_SRC[287:256] ^ SRC[287:256] 

TMP_SRC[319:288] ^ SRC[287:256] 

TMP_SRC[351:320] ^ SRC[351:320] 

TMP_SRC[383:352] ^ SRC[351:320] 

TMP_SRC[415:384] ^ SRC[415:384] 

TMP_SRC[447:416] ^ SRC[415:384] 

TMP_SRC[479:448] ^ SRC[479:448] 

TMP_SRC[511:480] ^ SRC[479:448] 

FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_SRC[I+31 :i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VMOVSLDUP {VEX.256 encoded version) 

DEST[31:0]^SRC[31:0] 

DEST[63:32]^SRC[31:0] 

DEST[95:64] ^ SRC[95:64] 

DEST[127:96] ^ SRC[95:64] 

DEST[159:128] ^ SRC[159:128] 

DEST[191:160] ^ SRC[159:128] 
DEST[223:192] ^ SRC[223:192] 
DEST[255:224] ^ SRC[223:192] 
DEST[MAX_VL-1:256]^0 


VMOVSLDUP (VEX.128 encoded version) 

DEST[31:0]^SRC[31:0] 

DEST[63:32]^SRC[31:0] 

DEST[95:64] ^ SRC[95:64] 

DEST[127:96] ^ SRC[95:64] 
DEST[MAX_VL-1:128]^0 
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MOVSLDUP (128-bit Legacy SSE version) 

DEST[31:0] ^SRC[31:0] 

DEST[63:32] ^SRC[31:0] 

DEST[95:64] ^SRC[95:64] 

DEST[127:96] ^SRC[95:64] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVSLDUP _m512 _mm512_moveldup_ps( _m512 a); 

VMOVSLDUP_mSI 2 _mm512_mask_moveldup_ps(_m512 s,_mmasklB k,_m512 a); 

VMOVSLDUP_mSI 2 _mm512_maskz_moveldup_ps(_mmaskIS k,_m512 a); 

VMOVSLDUP_m256 _mm256_mask_moveldup_ps(_m256 s,_mmaskS k,_m256 a); 

VMOVSLDUP_m256 _mm256_maskz_moveldup_ps(_mmaskS k,_m256 a); 

VMOVSLDUP_ml 28 _mm_mask_moveldup_ps(_ml 28 s,_mmask8 k,_ml 28 a); 

VMOVSLDUP_ml 28 _mm_maskz_moveldup_ps(_mmask8 k,_ml 28 a); 

VMOVSLDUP _m256 _mm256_moveldup_ps (_m256 a); 

VMOVSLDUP_ml 28 _mm_moveldup_ps (_ml 28 a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4; 

EVEX-encoded instruction, see Exceptions Type E4NF.nb. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 10/r 

MOVSS xmmi, xmm2 

RM 

V/V 

SSE 

Merge scalar single-precision floating-point value 
from xmm2 to xmmi register. 

F3 OF 10/r 

MOVSS xmm1,m32 

RM 

v/v 

SSE 

Load scalar single-precision floating-point value from 
m32 to xmmi register. 

VEX.NDS.LIG.F3.0F.WIG 10/r 

VMOVSS xmmi, xmm2, xmm3 

RVM 

V/V 

AVX 

Merge scalar single-precision floating-point value 
from xmm2 and xmm3 to xmmi register 

VEX.LIG.F3.0F.WIG 10/r 

VMOVSS xmm1,m32 

XM 

v/v 

AVX 

Load scalar single-precision floating-point value from 
m32 to xmmi register. 

F3 OF 11 /r 

MOVSS xmm2/m32, xmmi 

MR 

v/v 

SSE 

Move scalar single-precision floating-point value 
from xmmi register to xmm2/m32. 

VEX.NDS.LIG.F3.0F.WIG 11 /r 

VMOVSS xmmi, xmm2, xmm3 

MVR 

v/v 

AVX 

Move scalar single-precision floating-point value 
from xmm2 and xmm3 to xmmi register. 

VEX.LIG.F3.0F.WIG 11 /r 

VMOVSS m32, xmmi 

MR 

v/v 

AVX 

Move scalar single-precision floating-point value 
from xmmi register to m32. 

EVEX.NDS.LIG.F3.0F.W0 10/r 

VMOVSS xmmi {k1]{z], xmm2, xmm3 

RVM 

v/v 

AVX512F 

Move scalar single-precision floating-point value 
from xmm2 and xmm3 to xmmi register under 
writemask k1. 

EVEX.LIG.F3.0F.W0 10/r 

VMOVSS xmmi {k1]{z],m32 

T1S-RM 

v/v 

AVX512F 

Move scalar single-precision floating-point values 
from m32 to xmmi under writemask k1. 

EVEX.NDS.LIG.F3.0F.W0 11 /r 

VMOVSS xmmi {k1]{z}, xmm2, xmm3 

MVR 

v/v 

AVX512F 

Move scalar single-precision floating-point value 
from xmm2 and xmm3 to xmmi register under 
writemask k1. 

EVEX.LIG.F3.0F.W0 11 /r 

VMOVSS m32[k1}, xmmi 

T1S-MR 

v/v 

AVX512F 

Move scalar single-precision floating-point values 
from xmmi to m32 under writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

XM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MVR 

ModRM:r/m (w) 

vvvv (r) 

ModRMxeg (r) 

NA 

T1S-RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

T1S-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 
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Description 

Moves a scalar single-precision floating-point value from the source operand (second operand) to the destination 
operand (first operand). The source and destination operands can be XMM registers or 32-bit memory locations. 
This instruction can be used to move a single-precision floating-point value to and from the low doubleword of an 
XMM register and a 32-bit memory location, or to move a single-precision floating-point value between the low 
doublewords of two XMM registers. The instruction cannot be used to transfer data between memory locations. 

Legacy version: When the source and destination operands are XMM registers, bits (MAX_VL-1:32) of the corre¬ 
sponding destination register are unmodified. When the source operand is a memory location and destination 
operand is an XMM registers. Bits (127:32) of the destination operand is cleared to all Os, bits MAX_VL: 128 of the 
destination operand remains unchanged. 

VEX and EVEX encoded register-register syntax: Moves a scalar single-precision floating-point value from the 
second source operand (the third operand) to the low doubleword element of the destination operand (the first 
operand). Bits 127:32 of the destination operand are copied from the first source operand (the second operand). 
Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. 

VEX and EVEX encoded memory load syntax: When the source operand is a memory location and destination 
operand is an XMM registers, bits MAX_VL:32 of the destination operand is cleared to all Os. 

EVEX encoded versions: The low doubleword of the destination is updated according to the writemask. 

Note: For memory store form instruction "VMOVSS m32, xmml", VEX.vvvv is reserved and must be 1111b other¬ 
wise instruction will #UD. For memory store form instruction "VMOVSS mv {kl}, xmml", EVEX.vvvv is reserved 
and must be 1111b otherwise instruction will #UD. 

Software should ensure VMOVSS is encoded with VEX.L=0. Encoding VMOVSS with VEX.L=1 may encounter 
unpredictable behavior across different processor generations. 

Operation 

VMOVSS (EVEX.NDS.LIG.F3.0F.W0 11 /r when the source operand is memory and the destination is an XMM register) 

IF kl [0] or *no writemask* 

THEN DEST[31:0]^SRC[31:0] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[31:0]^0 
FI; 

FI; 

DEST[511:32] ^0 

VMOVSS (GVEX.NDS.LIG.F3.0F.W0 10 /r when the source operand is an XMM register and the destination is memory) 

IF kl [0] or *no writemask* 

THEN DEST[31:0]^SRC[31:0] 

ELSE *DEST[31:0] remains unchanged* ; merging-masking 

FI; 
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VMOVSS (E\/EX.NDS.LIG.F3.0F.W0 10/11 /r where the source and destination are XMM registers) 

IF k1 [0] or *no writemask* 

THEN DEST[31:0] ^ SRC2[31:0] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[31:0] ^0 
FI; 

FI; 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128]^0 

MOVSS (Legacy SSE version when the source and destination operands are both XMM registers) 

DEST[31:0] ^SRC[31:0] 

DEST[MAX_VL-1:32] (Unmodified) 

VMOVSS (VEX.NDS.128.F3.0F 11 /r where the destination is an XMM register) 

DEST[31:0] ^SRC2[31:0] 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

VMOVSS (VEX.NDS.128.F3.0F 10 /r where the source and destination are XMM registers) 

DEST[31:0] ^SRC2[31:0] 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

VMOVSS (VEX.NDS.128.F3.0F 10 /r when the source operand is memory and the destination is an XMM register) 

DEST[31:0] ^SRC[31:0] 

DEST[MAX_VL-1:32] ^0 

MOVSS/VMOVSS (when the source operand is an XMM register and the destination is memory) 

DEST[31:0] ^SRC[31:0] 

MOVSS (Legacy SSE version when the source operand is memory and the destination is an XMM register) 

DEST[31:0] ^SRC[31:0] 

DEST[127:32] ^0 
DEST[MAX_VL-1:128] (Unmodified) 

Intei C/C++ Compiier Intrinsic Equivaient 

VMOVSS_ml 28 _mm_mask_load_ss(_ml 28 s,_mmask8 k, float * p); 

VMOVSS_ml 28 _mm_maskz_load_ss(_mmask8 k, float * p); 

VMOVSS ml 28 _mm_mask_move_ss( ml 28 sh, mmask8 k, ml 28 si, ml 28 a); 

VMOVSS ml 28 _mm_maskz_move_ss( mmask8 k, ml 28 s, ml 28 a); 

VMOVSS void _mm_mask_store_ss(float * p,_mmask8 k,_ml 28 a); 

MOVSS_ml 28 _mm_load_ss(float * p) 

MOVSS vold_mm_store_ss(float * p,_ml 28 a) 

MOVSS_ml 28 _mm_move_ss(_ml 28 a,_ml 28 b) 

SIMD Floating-Point Exceptions 

None 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instruction, see Exceptions Type ElO. 
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MOVSX/MOVSXD—Move with Sign-Extension 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF BE /r 

MOVSX r16, r/mQ 

RM 

Valid 

Valid 

Move byte to word with sign-extension. 

OF BE /r 

MOVSX r3Z, r/mQ 

RM 

Valid 

Valid 

Move byte to doubleword with sign- 
extension. 

REX + OF BE /r 

MOVSX r64, r/mS* 

RM 

Valid 

N.E. 

Move byte to guadword with sign-extension. 

OF BF /r 

MOVSX rSZ, r/ml6 

RM 

Valid 

Valid 

Move word to doubleword, with sign- 
extension. 

REX.W + OF BF /r 

MOVSX r64, r/ml6 

RM 

Valid 

N.E. 

Move word to guadword with sign-extension. 

REX.W** + 63 /r 

MOVSXD r64, r/mQZ 

RM 

Valid 

N.E. 

Move doubleword to guadword with sign- 
extension. 


NOTES: 

* In 64-blt mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 

**The use of MOVSXD without REX.W in 64-bit mode is discouraged. Regular MOV should be used instead of using MOVSXD without 
REX.W. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Copies the contents of the source operand (register or memory location) to the destination operand (register) and 
sign extends the value to 16 or 32 bits (see Figure 7-6 in the Intel® 64 and IA-32 Architectures Software Devel¬ 
oper's Manual, Volume 1). The size of the converted value depends on the operand-size attribute. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the 
beginning of this section for encoding data and limits. 

Operation 

DEST ^ SignExtend(SRC); 

Flags Affected 

None. 


Protected Mode Exceptions 


#GP(0) 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If the LOCK prefix is used. 
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Real-Address Mode Exceptions 

#GP If a memory operand effective 

#SS If a memory operand effective 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective 

#SS(0) If a memory operand effective 

#PF(fault-code) If a page fault occurs. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 


address is outside the CS, DS, ES, FS, or GS segment limit, 
address is outside the SS segment limit. 


address is outside the CS, DS, ES, FS, or GS segment limit, 
address is outside the SS segment limit. 
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MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 10/r 

MOVUPD xmmi, xnnm2/nn128 

RM 

V/V 

SSE2 

Move unaligned packed double-precision floating¬ 
point from xmm2/mem to xmmi. 

66 OF 11 /r 

MOVUPD xmm2/m128, xmmi 

MR 

v/v 

SSE2 

Move unaligned packed double-precision floating¬ 
point from xmmi to xmm2/mem. 

VEX.128.66.0F.WIG 10/r 

VMOVUPD xmmi, xmm2/m128 

RM 

V/V 

AVX 

Move unaligned packed double-precision floating¬ 
point from xmm2/mem to xmmi. 

VEX.128.66.0F.WIG 11 /r 

VMOVUPD xmm2/m128, xmmi 

MR 

v/v 

AVX 

Move unaligned packed double-precision floating¬ 
point from xmmi to xmm2/mem. 

VEX.256.66.0F.WIG 10/r 

VMOVUPD ymmi, ymm2/m256 

RM 

v/v 

AVX 

Move unaligned packed double-precision floating¬ 
point from ymm2/mem to ymmi. 

VEX.256.66.0F.WIG 11 /r 

VMOVUPD ymm2/m256, ymmi 

MR 

v/v 

AVX 

Move unaligned packed double-precision floating¬ 
point from ymmi toymm2/mem. 

EVEX.128.66.0F.W1 10/r 

VMOVUPD xmmi {k1}{z}, xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed double-precision floating¬ 
point from xmm2/m128 to xmmi using 
writemask k1. 

EVEX.128.66.0F.W1 11 /r 

VMOVUPD xmm2/m128 [k1}[z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed double-precision floating¬ 
point from xmmi to xmm2/m128 using 
writemask k1. 

EVEX.256.66.0F.W1 10/r 

VMOVUPD ymmi {k1}[z], ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed double-precision floating¬ 
point from ymm2/m256 to ymmi using 
writemask k1. 

EVEX.256.66.0F.W1 11 /r 

VMOVUPD ymm2/m256 [k1 }[z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed double-precision floating¬ 
point from ymmi to ymm2/m256 using 
writemask k1. 

EVEX.512.66.0F.W1 10/r 

VMOVUPD zmmi {k1}[z], zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move unaligned packed double-precision floating¬ 
point values from zmm2/m512 to zmmi using 
writemask k1. 

EVEX.512.66.0F.W1 11 /r 

VMOVUPD zmm2/m512 {k1 }{z}, zmmi 

FVM-MR 

v/v 

AVX512F 

Move unaligned packed double-precision floating¬ 
point values from zmmi to zmm2/m512 using 
writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM-MR 

ModRM;r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 

EVEX.512 encoded version: 

Moves 512 bits of packed double-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a ZMM register from a float64 memory 
location, to store the contents of a ZMM register into a memory. The destination operand is updated according to 
the writemask. 
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VEX.256 encoded version: 

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a VMM register from a 256-bit memory 
location, to store the contents of a VMM register into a 256-bit memory location, or to move data between two YMM 
registers. Bits (MAX_VL-1:256) of the destination register are zeroed. 


128-bit versions : 

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory 
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two 
XMM registers. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 

When the source or destination operand is a memory operand, the operand may be unaligned on a 16-byte 
boundary without causing a general-protection exception (#GP) to be generated 

VEX.128 and EVEX.128 encoded versions: Bits (MAX_VL-1:128) of the destination register are zeroed. 

Operation 

VMOVUPD (EVEX encoded versions, register-copy form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FORj^OTO KL-1 
i ^ j * 64 

IF k10] OR *no writemask* 

THEN DEST[I+63:I] ^ SRC[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVUPD (EVEX encoded versions, store-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FORj^OTO KL-1 
i ^ j * 64 

IF k10] OR *no writemask* 

THEN DEST[i+63:i]^ SRC[i+63:i] 

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking 


FI; 

ENDFOR; 
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VMOVUPD (EVEX encoded versions, load-form) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ SRC[i+63:l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE DEST[i+63:i] <- 0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VMOVUPD (VEX.256 encoded version, load - and register copy) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVUPD (VEX.256 encoded version, store-form) 

DEST[255:0] ^ SRC[255:0] 

VMOVUPD (VEX.128 encoded version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128]^0 

MOVUPD (128-bit load- and register-copy- form Legacy SSE version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128] (Unmodified) 

(V)MOVUPD (128-bit store-form version) 

DEST[127:0] ^ SRC[127:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVUPD _m512d _mm512Joadu_pd( void * s); 

VMOVUPD_m512d _mm512_mask_loadu_pd(_m512d a,_mmask8 k, void * s); 

VMOVUPD_m512d_mm512_maskz_loadu_pd(_mmask8 k, void * s); 

VMOVUPD void _mm512_storeu_pd( void * d,_m512d a); 

VMOVUPD void _mm512_mask_storeu_pd( void * d,_mmask8 k,_m512d a); 

VMOVUPD_m256d _mm256_mask_loadu_pd(_m256d s,_mmask8 k, void * m); 

VMOVUPD_m256d _mm256_maskz_loadu_pd(_mmask8 k, void * m); 

VMOVUPD void _mm256_mask_storeu_pd( void * d,_mmask8 k,_m256d a); 

VMOVUPD_ml 28d _mm_mask_loadu_pd(_ml 28d s,_mmask8 k, void * m); 

VMOVUPD_ml 28d _mm_maskz_loadu_pd(_mmask8 k, void * m); 

VMOVUPD void _mm_mask_storeu_pd( void * d,_mmask8 k,_ml 28d a); 

MOVUPD _m256d _mm256Joadu_pd (double * p); 

MOVUPD void _mm256_storeu_pd( double *p,_m256d a); 

MOVUPD_ml 28d _mm_loadu_pd (double * p); 

MOVUPD void _mm_storeu_pd( double *p,_ml 28d a); 

SIMD Floating-Point Exceptions 

None 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
Note treatment of #AC varies; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 
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MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 10/r 

MOVUPS xmmi, xmm2/m128 

RM 

V/V 

SSE 

Move unaligned packed single-precision 
floating-point from xmm2/mem to xmmi. 

OF 11 /r 

MOVUPS xmm2/m128, xmmi 

MR 

v/v 

SSE 

Move unaligned packed single-precision 
floating-point from xmmi to xmm2/mem. 

VEX.128.0F.WIG 10/r 

VMOVUPS xmmi, xmm2/m128 

RM 

V/V 

AVX 

Move unaligned packed single-precision 
floating-point from xmm2/mem to xmmi. 

VEX.128.0F 11.WIG /r 

VMOVUPS xmm2/m128, xmmi 

MR 

v/v 

AVX 

Move unaligned packed single-precision 
floating-point from xmmi to xmm2/mem. 

VEX.256.0F 10.WIG /r 

VMOVUPS ymm1,ymm2/m256 

RM 

v/v 

AVX 

Move unaligned packed single-precision 
floating-point from ymm2/mem to ymmi. 

VEX.256.0F 11.WIG /r 

VMOVUPS ymm2/m256, ymmi 

MR 

v/v 

AVX 

Move unaligned packed single-precision 
floating-point from ymmi to ymm2/mem. 

EVEX.128.0F.W0 10/r 

VMOVUPS xmmi {k1}{z}, xmm2/m128 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed single-precision 
floating-point values from xmm2/m128 to 
xmmi using writemask k1. 

EVEX.256.0F.W0 10 /r 

VMOVUPS ymmi [k1 }{z}, ymm2/m256 

FVM-RM 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed single-precision 
floating-point values from ymm2/m256 to 
ymmi using writemask k1. 

EVEX.512.0F.W0 10/r 

VMOVUPS zmmi [k1}[z}, zmm2/m512 

FVM-RM 

v/v 

AVX512F 

Move unaligned packed single-precision 
floating-point values from zmm2/m512 to 
zmmi using writemask k1. 

EVEX.128.0F.W0 11 /r 

VMOVUPS xmm2/m128 [k1 }{z}, xmmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed single-precision 
floating-point values from xmmi to 
xmm2/m128 using writemask k1. 

EVEX.256.0F.W0 11 /r 

VMOVUPS ymm2/m256 {k1 }{z}, ymmi 

FVM-MR 

v/v 

AVX512VL 

AVX512F 

Move unaligned packed single-precision 
floating-point values from ymmi to 
ymm2/m256 using writemask k1. 

EVEX.512.0F.W0 11 /r 

VMOVUPS zmm2/m512 {k1}{z}, zmmi 

FVM-MR 

v/v 

AVX512F 

Move unaligned packed single-precision 
floating-point values from zmmi to 
zmm2/m512 using writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

FVM-RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM-MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 


Description 

Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 

EVEX.512 encoded version: 

Moves 512 bits of packed single-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32 
memory location, to store the contents of a ZMM register into memory. The destination operand is updated 
according to the writemask. 
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VEX.256 and EVEX.256 encoded versions: 

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load a VMM register from a 256-bit memory 
location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM 
registers. Bits (MAX_VL-1:256) of the destination register are zeroed. 


128-bit versions : 

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the 
destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory 
location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two 
XMM registers. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 

When the source or destination operand is a memory operand, the operand may be unaligned without causing a 
general-protection exception (#GP) to be generated. 

VEX.128 and EVEX.128 encoded versions: Bits (MAX_VL-1:128) of the destination register are zeroed. 

Operation 

VMOVUPS (EVEX encoded versions, register-copy form) 

(KL, VL) = (4,1 28), (8, 256), (16, 512) 

FORj^OTO KL-1 
i ^j*32 

IF k10] OR *no writemask* 

THEN DEST[i+31:l] ^ SRC[i+31:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VMOVUPS (EVEX encoded versions, store-form) 

(KL, VL) = (4,1 28), (8, 256), (16, 512) 

FORj^OTO KL-1 
i ^j*32 

IF k10] OR *no writemask* 

THEN DEST[i+31 :i]^ SRC[i+31 :i] 

ELSE *DEST[i+31 :i] remains unchanged* ; merging-masking 
FI; 

ENDFOR; 
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VMOVUPS (EVEX encoded versions, load-form) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ SRC[i+31 :l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE DEST[i+31:i] <-0 ; zeroing-masking 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VMOVUPS (VEX.256 encoded version, load - and register copy) 

DEST[255:0] ^ SRC[255:0] 

DEST[MAX_VL-1:256]^0 

VMOVUPS (VEX.256 encoded version, store-form) 

DEST[255:0] ^ SRC[255:0] 

VMOVUPS (VEX.128 encoded version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128]^0 

MOVUPS (128-bit load- and register-copy- form Legacy SSE version) 

DEST[127:0] ^ SRC[127:0] 

DEST[MAX_VL-1:128] (Unmodified) 

(V)MOVUPS (128-bit store-form version) 

DEST[127:0] ^ SRC[127:0] 

Intel C/C++ Compiler Intrinsic Equivalent 

VMOVUPS _m512 _mm512Joadu_ps( void * s); 

VMOVUPS_m512 _mm512_mask_loadu_ps(_m512 a,_mmask16 k, void * s); 

VMOVUPS_m512_mm512_maskz_loadu_ps(_mmaskIS k, void * s); 

VMOVUPS void _mm512_storeu_ps( void * d,_m512 a); 

VMOVUPS void _mm512_mask_storeu_ps( void * d,_mmask8 k,_m512 a); 

VMOVUPS_m256 _mm256_mask_loadu_ps(_m256 a,_mmask8 k, void * s); 

VMOVUPS_m256 _mm256_maskz_loadu_ps(_mmask8 k, void * s); 

VMOVUPS void _mm256_mask_storeu_ps( void * d,_mmask8 k,_m256 a); 

VMOVUPS_ml 28 _mm_mask_loadu_ps(_ml 28 a,_mmask8 k, void * s); 

VMOVUPS_ml 28 _mm_maskz_loadu_ps(_mmask8 k, void * s); 

VMOVUPS void _mm_mask_storeu_ps( void * d,_mmask8 k,_ml 28 a); 

MOVUPS _m256 _mm256Joadu_ps (float * p); 

MOVUPS void _mm256 _storeu_ps( float *p,_m256 a); 

MOVUPS_ml 28 _mm_loadu_ps (float * p); 

MOVUPS void _mm_storeu_ps( float *p,_ml 28 a); 

SIMD Floating-Point Exceptions 

None 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

Note treatment of #AC varies; 

EVEX-encoded instruction, see Exceptions Type E4.nb. 

#UD If EVEX.vvvv != llllB or VEX.vvvv != llllB. 
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MOVZX—Move with Zero-Extend 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 86 /r 

MOVZX r16, r/m8 

RM 

Valid 

Valid 

Move byte to word with zero-extension. 

OF 86 /r 

MOVZX r32, r/m8 

RM 

Valid 

Valid 

Move byte to doubleword, zero-extension. 

REX.W + OF 86 /r 

MOVZX r64, r/m8* 

RM 

Valid 

N.E. 

Move byte to quadword, zero-extension. 

OF 87 /r 

MOVZX r32, r/m16 

RM 

Valid 

Valid 

Move word to doubleword, zero-extension. 

REX.W + OF 87 /r 

MOVZX r64, r/m16 

RM 

Valid 

N.E. 

Move word to quadword, zero-extension. 


NOTES: 

* In 64-blt mode, r/m8 can not be encoded to access the following byte registers if the REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Copies the contents of the source operand (register or memory location) to the destination operand (register) and 
zero extends the value. The size of the converted value depends on the operand-size attribute. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bit operands. See the summary chart 
at the beginning of this section for encoding data and limits. 

Operation 

DEST ^ ZeroExtend(SRC); 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 
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\/irtual-8086 Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 
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MPSADBW — Compute Multiple Packed Sums of Absolute Difference 


Opcode/ 

Instruction 

Op/ 

En 

64/32-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 42 /r lb 

MPSADBW xmmi, xmm2/m128, imm8 

RMI 

V/V 

SSE4_1 

Sums absolute 8-bit integer difference of 
adjacent groups of 4 byte integers in xmmi 
and xmm2/m128 and writes the results in 
xmmi. Starting offsets within xmmi and 
xmm2/m128are determined by imm8. 

VEX.NDS.128.66.0F3A.WIG 42 /r lb 

VMPSADBW xmmi, xmm2, xmm3/m128, imm8 

RVMI 

v/v 

AVX 

Sums absolute 8-bit integer difference of 
adjacent groups of 4 byte integers in xmm2 
and xmm3/m128an6 writes the results in 
xmmi. Starting offsets within xmm2 and 
xmm3/m128a\'e determined by imm8. 

VEX.NDS.256.66.0F3A.WIG 42 /r lb 

VMPSADBW ymmi, ymm2, ymm3/m256, imm8 

RVMI 

V/V 

AVX2 

Sums absolute 8-bit integer difference of 
adjacent groups of 4 byte integers in xmm2 
and ymm3/m 128 and writes the results in 
ymmi. Starting offsets within ymm2and 
xmm3/m128are determined by imm8. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 


Description 

(V)MPSADBW calculates packed word results of sum-absolute-difference (SAD) of unsigned bytes from two blocks 
of 32-bit dword elements, using two select fields in the immediate byte to select the offsets of the two blocks within 
the first source operand and the second operand. Packed SAD word results are calculated within each 128-bit lane. 
Each SAD word result is calculated between a stationary block_2 (whose offset within the second source operand 
is selected by a two bit select control, multiplied by 32 bits) and a sliding block_l at consecutive byte-granular 
position within the first source operand. The offset of the first 32-bit block of block_l is selectable using a one bit 
select control, multiplied by 32 bits. 

128-bit Legacy SSE version: Imm8[l:0]*32 specifies the bit offset of block_2 within the second source operand. 
Imm[2]*32 specifies the initial bit offset of the block_l within the first source operand. The first source operand 
and destination operand are the same. The first source and destination operands are XMM registers. The second 
source operand is either an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding 
VMM destination register remain unchanged. Bits 7:3 of the immediate byte are ignored. 

VEX.128 encoded version: Imm8[l:0]*32 specifies the bit offset of block_2 within the second source operand. 
Imm[2]*32 specifies the initial bit offset of the block_l within the first source operand. The first source and desti¬ 
nation operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory 
location. Bits (127:128) of the corresponding VMM register are zeroed. Bits 7:3 of the immediate byte are ignored. 

VEX.256 encoded version: The sum-absolute-difference (SAD) operation is repeated 8 times for MPSADW between 
the same block_2 (fixed offset within the second source operand) and a variable block_l (offset is shifted by 8 bits 
for each SAD operation) in the first source operand. Each 16-bit result of eight SAD operations between block_2 
and block_l is written to the respective word in the lower 128 bits of the destination operand. 

Additionally, VMPSADBW performs another eight SAD operations on block_4 of the second source operand and 
block_3 of the first source operand. (Imm8[4:3]*32 -i- 128) specifies the bit offset of block_4 within the second 
source operand. (Imm[5]*32-i-128) specifies the initial bit offset of the block_3 within the first source operand. 
Each 16-bit result of eight SAD operations between block_4 and block_3 is written to the respective word in the 
upper 128 bits of the destination operand. 
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The first source operand is a VMM register. The second source register can be a VMM register or a 256-bit memory 
location. The destination operand is a YMM register. Bits 7:6 of the immediate byte are ignored. 

Note: If VMPSADBW is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will 
cause an #UD exception. 


Imm[4:3]*32+128 

255 224 192 i 128 



Destination 


lmm[1:0]*32 

127 96 64 I 0 



Figure 4-5. 256-bit VMPSADBW Operation 
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Operation 

VMPSADBW (VEX.Z56 encoded version) 

BLK2_0FFSET ^ Imm8[1:0]*32 
BLK1 .OFFSET ^ Imm8[2]*32 
SRCl.BYTEO ^ SRC1[BLK1_OFFSET+7:BLK1 .OFFSET] 
SRC1.BYTE1 ^ SRC1[BLK1.0FFSET+15:BLK1.0FFSET+8] 
SRC1.BYTE2 ^ SRC1 [BLK1.0FFSET+23:BLK1.0FFSET+16] 
SRC1.BYTE3 ^ SRC1 [BLK1.0FFSET+31 :BLK1.0FFSET+24] 
SRC1.BYTE4 ^SRCI [BLK1.0FFSET+39:BLK1.0FFSET+32] 
SRC1.BYTE5 ^ SRC1 [BLK1.0FFSET+47:BLK1.0FFSET+40] 
SRC1.BYTE6 ^ SRC1 [BLK1.0FFSET+55:BLK1.0FFSET+48] 
SRC1.BYTE7 ^ SRC1 [BLK1.0FFSET+63:BLK1.0FFSET+56] 
SRC1.BYTE8 ^ SRC1 [BLK1.0FFSET+71 :BLK1.0FFSET+64] 
SRC1.BYTE9 ^ SRC1 [BLK1.0FFSET+79:BLK1.0FFSET+72] 
SRC1.BYTE10 ^ SRC1[BLK1.0FFSET+87:BLK1.0FFSET+80] 
SRC2.BYTE0^SRC2[BLK2.0FFSET+7:BLK2.0FFSET] 
SRC2.BYTE1 ^ SRC2[BLK2.0FFSET+15:BLK2.0FFSET+8] 
SRC2.BYTE2 ^ SRC2[BLK2.0FFSET+23:BLK2.0FFSET+16] 
SRC2.BYTE3 ^ SRC2[BLK2.0FFSET+31 :BLK2.0FFSET+24] 

TEMPO ^ ABS(SRC1.BYTE0 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE1 -SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE2 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE3 - SRC2.BYTE3) 

DEST[15:0] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE1 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE2 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE3 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE4 - SRC2.BYTE3) 

DEST[31:16] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE2 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE3 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE4 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTES - SRC2.BYTE3) 

DEST[47:32] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE3 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE4 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTES - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTES - SRC2.BYTE3) 

DEST[63:48] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE4 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTES - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTES - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE7 - SRC2.BYTE3) 

DEST[79:S4] ^ TEMPO + TEMPI + TEMP2 + TEMP3 
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TEMPO ^ ABS(SRC1 .BYTES - SRC2_BYTEO) 

TEMPI ^ ABS(SRC1_BYTE6 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE7 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1 .BYTES - SRC2.BYTE3) 

DEST[95:80] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTES - SRC2.BYTEO) 

TEMPI ^ ABS(SRC1.BYTE7 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTES - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE9 - SRC2.BYTE3) 

DEST[111:96] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE7 - SRC2.BYTEO) 

TEMPI ^ ABS(SRC1.BYTES - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE9 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE10 - SRC2.BYTE3) 

DEST[127:112] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

BLK2.0FFSET ^ Imm8[4:3]*32 + 128 
BLK1.0FFSET ^ Imm8[5]*32 + 128 
SRC1.BYTE0 ^ SRC1[BLK1.0FFSET+7:BLK1.0FFSET] 
SRC1.BYTE1 ^ SRC1[BLK1.0FFSET+15:BLK1.0FFSET+8] 
SRC1.BYTE2 ^ SRC1 [BLK1.0FFSET+23:BLK1.0FFSET+16] 
SRC1.BYTE3 ^ SRC1 [BLK1.0FFSET+31 :BLK1.0FFSET+24] 
SRC1.BYTE4 ^ SRC1 [BLK1.0FFSET+39:BLK1.0FFSET+32] 
SRC1.BYTE5 ^ SRC1 [BLK1.0FFSET+47;BLK1.0FFSET+40] 
SRC1.BYTE6 ^ SRC1 [BLK1.0FFSET+55:BLK1.0FFSET+48] 
SRC1.BYTE7 ^ SRC1 [BLK1.0FFSET+63:BLK1.0FFSET+56] 
SRC1.BYTE8 ^ SRC1 [BLK1.0FFSET+71 :BLK1.0FFSET+64] 
SRC1.BYTE9 ^ SRC1 [BLK1.0FFSET+79:BLK1.0FFSET+72] 
SRC1.BYTE10 ^ SRC1[BLK1.0FFSET+87:BLK1.0FFSET+80] 

SRC2.BYTE0 ^SRC2[BLK2.0FFSET+7:BLK2.0FFSET] 
SRC2.BYTE1 ^ SRC2[BLK2.0FFSET+1 5:BLK2.0FFSET+8] 
SRC2.BYTE2 ^ SRC2[BLK2.0FFSET+23:BLK2.0FFSET+16] 
SRC2.BYTE3 ^ SRC2[BLK2.0FFSET+31 :BLK2.0FFSET+24] 

TEMPO ^ ABS(SRC1.BYTE0 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE1 -SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE2 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE3 - SRC2.BYTE3) 

DEST[143:128] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ABS(SRC1.BYTE1 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE2 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE3 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE4 - SRC2.BYTE3) 

DEST[159:144] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE2 - SRC2.BYTE0) 

TEMPI ^ ABS(SRC1.BYTE3 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE4 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTES - SRC2.BYTE3) 

DEST[175:160] ^ TEMPO + TEMPI + TEMP2 + TEMP3 
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TEMPO ^ABS(SRC1_BYTE3 - SRC2_BYTEO) 

TEMPI ^ ABS(SRC1_BYTE4 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1 .BYTES - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1 .BYTES - SRC2.BYTE3) 

DEST[191:176] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE4 - SRC2.BYTEO) 

TEMPI ^ ABS(SRC1.BYTES - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTES - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE7 - SRC2.BYTE3) 
DEST[207:192] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTES - SRC2.BYTEO) 

TEMPI ^ ABS(SRC1.BYTES - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE7 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTES - SRC2.BYTE3) 
DEST[223:208] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTES - SRC2.BYTEO) 

TEMPI ^ ABS(SRC1.BYTE7 - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTES - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE9 - SRC2.BYTE3) 
DEST[239:224] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1.BYTE7 - SRC2.BYTEO) 

TEMPI ^ ABS(SRC1.BYTES - SRC2.BYTE1) 

TEMP2 ^ ABS(SRC1.BYTE9 - SRC2.BYTE2) 

TEMP3 ^ ABS(SRC1.BYTE10 - SRC2.BYTE3) 
DEST[2SS:240] ^ TEMPO + TEMPI + TEMP2 + TEMP3 


VMPSADBW (VEX.128 encoded version) 

BLK2.0FFSET ^ Imm8[1:0]*32 
BLK1.OFFSET ^ Imm8[2]*32 
SRC1.BYTEO ^ SRC1[BLK1.0FFSET+7:BLK1.OFFSET] 
SRC1.BYTE1 ^ SRC1[BLK1.0FFSET+1S:BLK1.0FFSET+8] 
SRC1.BYTE2 ^ SRC1 [BLK1.0FFSET+23:BLK1.0FFSET+16] 
SRC1.BYTE3 ^ SRC1 [BLK1.0FFSET+31 :BLK1.0FFSET+24] 
SRC1.BYTE4 ^ SRC1 [BLK1.0FFSET+39:BLK1.0FFSET+32] 
SRC1.BYTES ^ SRC1[BLK1.0FFSET+47:BLK1.0FFSET+40] 
SRC1.BYTE6 ^ SRC1 [BLK1.0FFSET+SS:BLK1.0FFSET+48] 
SRC1.BYTE7 ^ SRC1 [BLK1.0FFSET+63:BLK1.0FFSET+S6] 
SRC1.BYTE8 ^ SRC1 [BLK1.0FFSET+71 :BLK1.0FFSET+64] 
SRC1.BYTE9 ^ SRC1 [BLK1.0FFSET+79:BLK1.0FFSET+72] 
SRC1.BYTE10 ^ SRC1[BLK1.0FFSET+87:BLK1.0FFSET+80] 

SRC2.BYTE0^SRC2[BLK2.0FFSET+7:BLK2.0FFSET] 
SRC2.BYTE1 ^ SRC2[BLK2.0FFSET+1 S:BLK2.0FFSET+8] 
SRC2.BYTE2 ^ SRC2[BLK2.0FFSET+23:BLK2.0FFSET+16] 
SRC2.BYTE3 ^ SRC2[BLK2.0FFSET+31 :BLK2.0FFSET+24] 
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TEMPO ^ ABS(SRC1_BYTEO - SRC2_BYTEO) 

TEMPI ^ ABS(SRC1_BYTE1 -SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE2 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE3 - SRC2_BYTE3) 

DEST[15:0] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE1 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE2 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE3 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE4 - SRC2_BYTE3) 

DEST[31:16] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE2 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE3 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE4 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE5 - SRC2_BYTE3) 
DEST[47:32] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE3 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE4 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE5 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE6 - SRC2_BYTE3) 
DEST[63:48] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE4 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE5 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE6 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE7 - SRC2_BYTE3) 
DEST[79:64] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE5 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE6 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE7 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE8 - SRC2_BYTE3) 
DEST[95:80] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE6 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE7 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE8 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE9 - SRC2_BYTE3) 

DEST[111:96] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS(SRC1_BYTE7 - SRC2_BYTE0) 

TEMPI ^ ABS(SRC1_BYTE8 - SRC2_BYTE1) 

TEMP2 ^ ABS(SRC1_BYTE9 - SRC2_BYTE2) 

TEMP3 ^ ABS(SRC1_BYTE10 - SRC2_BYTE3) 

DEST[127:112] ^ TEMPO + TEMPI + TEMP2 + TEMP3 
DEST[VLMAX-1:1281^0 
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MPSADBW (128-bit Legacy SSE version) 

SRC_OFFSET ^ imm8[1:0]*32 
DEST_OFFSET ^ imm8[2]*32 
DEST_BYTEO ^ DEST[DEST_OFFSET+7:DEST_OFFSET] 
DEST_BYTE1 ^ DEST[DEST_OFFSET+15:DEST_OFFSET+8] 
DEST_BYTE2 ^ DEST[DEST_OFFSET+23:DEST_OFFSET+16] 
DEST_BYTE3 ^ DEST[DEST_OFFSET+31 :DEST_OFFSET+24] 
DEST_BYTE4 ^ DEST[DEST_OFFSET+39:DEST_OFFSET+32] 
DEST_BYTE5 ^ DEST[DEST_0FFSET+47:DEST_0FFSET+40] 
DEST_BYTE6 ^ DEST[DEST_OFFSET+55:DEST_OFFSET+48] 
DEST_BYTE7 ^ DEST[DEST_OFFSET+63:DEST_OFFSET+56] 
DEST_BYTE8 ^ DEST[DEST_OFFSET+71 :DEST_OFFSET+64] 
DEST_BYTE9 ^ DEST[DEST_OFFSET+79:DEST_OFFSET+72] 
DEST_BYTE10 ^ DEST[DEST_OFFSET+87:DEST_OFFSET+80] 

SRC_BYTEO ^ SRC[SRC_OFFSET+7:SRC_OFFSET] 

SRC_BYTE1 ^ SRC[SRC_OFFSET+15:SRC_OFFSET+8] 
SRC_BYTE2 ^ SRC[SRC_OFFSET+23:SRC_OFFSET+16] 
SRC_BYTE3 ^ SRC[SRC_OFFSET+31 :SRC_OFFSET+24] 

TEMPO ^ ABS( DEST_BYTEO - SRC_BYTE0) 

TEMPI ^ ABS( DEST_BYTE1 -SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE2 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE3 - SRC_BYTE3) 

DEST[15:0] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS( DEST_BYTE1 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE2 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE3 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE4 - SRC_BYTE3) 

DEST[31:16] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS( DEST_BYTE2 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE3 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE4 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE5 - SRC_BYTE3) 

DEST[47:32] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS( DEST_BYTE3 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE4 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE5 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE6 - SRC_BYTE3) 

DEST[63:48] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS( DEST_BYTE4 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE5 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE6 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE7 - SRC_BYTE3) 

DEST[79:64] ^ TEMPO + TEMPI + TEMP2 + TEMP3 
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TEMPO ^ ABS( DEST_BYTE5 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE6 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE7 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE8 - SRC_BYTE3) 

DEST[95:80] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS( DEST_BYTE6 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE7 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE8 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE9 - SRC_BYTE3) 

DEST[111:96] ^ TEMPO + TEMPI + TEMP2 + TEMP3 

TEMPO ^ ABS( DEST_BYTE7 - SRC_BYTEO) 

TEMPI ^ ABS( DEST_BYTE8 - SRC_BYTE1) 

TEMP2 ^ ABS( DEST_BYTE9 - SRC_BYTE2) 

TEMP3 ^ ABS( DEST_BYTE10 - SRC_BYTE3) 

DEST[127:112] ^ TEMPO + TEMPI + TEMP2 + TEMP3 
DEST[VLMAX-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

(V)MPSADBW: _ml 281 _mm_mpsadbw_epu8 (_ml 281 si,_ml 281 s2, const Int mask); 

VMPSADBW: _m256i _mm256_mpsadbw_epu8 (_m256i si,_m256l s2, const Int mask); 

Flags Affected 

None 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 
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MUL—Unsigned Multiply 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F6 /4 

MUL r/mS 

M 

Valid 

Valid 

Unsigned multiply (AX AL * r/mS). 

REX + F6 /4 

MUL r/mS 

M 

Valid 

N.E. 

Unsigned multiply (AX AL * r/mS). 

F7 /4 

MUL r/m 7 6 

M 

Valid 

Valid 

Unsigned multiply (DX:AX <- AX * r/ml6). 

F7 /4 

MUL ^/m3^ 

M 

Valid 

Valid 

Unsigned multiply (EDX:EAX ^ EAX * r/m32). 

REX.W + F7 /4 

MUL r/m6A 

M 

Valid 

N.E. 

Unsigned multiply (RDX:RAX RAX * r/m64). 


NOTES: 

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r) 

NA 

NA 

NA 


Description 

Performs an unsigned multiplication of the first operand (destination operand) and the second operand (source 
operand) and stores the result in the destination operand. The destination operand is an implied operand located in 
register AL, AX or EAX (depending on the size of the operand); the source operand is located in a general-purpose 
register or a memory location. The action of this instruction and the location of the result depends on the opcode 
and the operand size as shown in Table 4-9. 

The result is stored in register AX, register pair DX:AX, or register pair EDX:EAX (depending on the operand size), 
with the high-order bits of the product contained in register AH, DX, or EDX, respectively. If the high-order bits of 
the product are 0, the CF and OF flags are cleared; otherwise, the flags are set. 

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi¬ 
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. 

See the summary chart at the beginning of this section for encoding data and limits. 


Table 4-9. MUL Results 


Operand Size 

Source 1 

Source 2 

Destination 

Byte 

AL 

r/m8 

AX 

Word 

AX 

r/m16 

DX:AX 

Doubleword 

EAX 

r/m32 

EDX:EAX 

Quadword 

RAX 

r/m64 

RDX:RAX 
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Operation 

IF (Byte operation) 

THEN 

AX ^ AL * SRC; 

ELSE (* Word or doubleword operation *) 

IF OperandSize= 16 
THEN 

DX:AX ^ AX * SRC; 
ELSEIFOperandSize = 32 

THEN EDX:EAX ^ EAX * SRC; FI; 
ELSE (* OperandSize = 64 *) 
RDX:RAX ^ RAX SRC; 

FI; 

FI; 


Flags Affected 

The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1. The SF, ZF, AF, and 
PF flags are undefined. 

Protected Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 


64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 


#AC(0) 


If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 


MUL—Unsigned Multiply 
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MULPD—Multiply Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

Gn 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 59 /r 

MULPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Multiply packed double-precision floating-point values 
in xmm2/m128 with xmmi and store result in xmmi. 

VEX.NDS.128.66.0F.WIG 59/r 

VMULPD xmm1,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Multiply packed double-precision floating-point values 
in xmm3/m128 with xmm2 and store result in xmmi. 

VEX.NDS.256.66.0F.WIG 59 /r 

VMULPD ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Multiply packed double-precision floating-point values 
in ymm3/m256 with ymm2 and store result in ymmi. 

EVEX.NDS.128.66.0F.W1 59/r 

VMULPD xmmi {k1}{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply packed double-precision floating-point values 
from xmm3/m128/m64bcst to xmm2 and store result 
in xmmi. 

EVEX.NDS.256.66.0F.W1 59 /r 

VMULPD ymmi [kl }{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply packed double-precision floating-point values 
from ymm3/m256/m64bcst to ymm2 and store result 
in ymmi. 

EVEX.NDS.51 2.66.0F.W1 59 /r 

VMULPD zmmi [kl }[z], zmm2, 
zmm3/m512/m64bcst[er} 

FV 

v/v 

AVX512F 

Multiply packed double-precision floating-point values 
in zmm3/m512/m64bcst with zmm2 and store result 
in zmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiply packed double-precision floating-point values from the first source operand with corresponding values in 
the second source operand, and stores the packed double-precision floating-point results in the destination 
operand. 

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally 
updated with writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAX_VL-1:256) of the 
corresponding destination ZMM register are zeroed. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the destination YMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. 
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Operation 

VMULPD (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *ls a register* 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

FORj^OTO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

DEST[I+63:I] ^ SRC1 [1+63:1] * SRC2[63:0] 

ELSE 

DEST[I+63:I] ^ SRC1 [1+63:1] * SRC2[i+63:i] 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMULPD (VEX.256 encoded version) 

DEST[63:0] ^SRCI [63:0] * SRC2[63:0] 

DEST[127:64] ^SRCI [127:64] * SRC2[127:64] 
DEST[191:128] ^SRCI [191:128] * SRC2[191:128] 
DEST[255:192] ^SRCI [255:192] * SRC2[255:192] 
DEST[MAX_VL-1:256] ^0; 


VMULPD (VEX.128 encoded version) 

DEST[63:0] ^SRCI [63:0] * SRC2[63:0] 

DEST[127:64] ^SRCI [127:64] * SRC2[127:64] 
DEST[MAX_VL-1:128] ^0 


MULPD (128-bit Legacy SSE version) 

DEST[63:0] ^DEST[63:0] * SRC[63:0] 

DEST[127:64] ^DEST[127:64] * SRC[127:64] 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivaient 

VMULPD _m512d _mm512_muLpd( _m512d a, _m512d b); 

VMULPD_mSI 2d_mm512_mask_mul_pd(_mSI 2d s,_mmaskS k,_mSI 2d a,_mSI 2d b); 

VMULPD_mSI 2d_mm512_maskz_mul_pd(_mmaskS k,_mSI 2d a,_mSI 2d b); 

VMULPD_m512d_mm512_mul_round_pd(_m512d a,_m512d b, Int); 

VMULPD_mSI 2d_mm512_mask_mul_round_pd(_mSI 2d s,_mmaskS k,_mSI 2d a,_mSI 2d b, int); 

VMULPD_m512d_mm512_maskz_mul_round_pd(_mmaskS k,_m512d a,_m512d b, int); 

VMULPD _m256d _mm256_muLpd (_m256d a, _m256d b); 

MULPD _m128d _mm_muLpd (_m128d a, _m128d b); 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 

EVEX-encoded instruction, see Exceptions Type E2. 
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MULPS—Multiply Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 59 /r 

MULPS xmmi, xmm2/m128 

RM 

V/V 

SSE 

Multiply packed single-precision floating-point values in 
xmm2/m128 with xmmi and store result in xmmi. 

VEX.NDS.128.0F.WIG 59 /r 

VMULPS xmm1,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Multiply packed single-precision floating-point values in 
xmm3/m128 with xmm2 and store result in xmmi. 

VEX.NDS.256.0F.WIG 59 /r 

VMULPS ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Multiply packed single-precision floating-point values in 
ymm3/m256 with ymm2 and store result in ymmi. 

EVEX.NDS.128.0F.W0 59 /r 

VMULPS xmmi [k1}[z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply packed single-precision floating-point values 
from xmm3/m128/m32bcst to xmm2 and store result in 
xmmi. 

EVEX.NDS.256.0F.W0 59 /r 

VMULPS ymmi {k1}{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply packed single-precision floating-point values 
from ymm3/m256/m32bcst to ymm2 and store result in 
ymmi. 

EVEX.NDS.512.0F.W0 59 /r 

VMULPS zmmi {k1}{z}, zmm2, 
zmm3/m512/m32bcst (er) 

FV 

v/v 

AVX512F 

Multiply packed single-precision floating-point values in 
zmm3/m512/m32bcst with zmm2 and store result in 
zmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiply the packed single-precision floating-point values from the first source operand with the corresponding 
values in the second source operand, and stores the packed double-precision floating-point results in the destina¬ 
tion operand. 

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally 
updated with writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAX_VL-1:256) of the 
corresponding destination ZMM register are zeroed. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the destination YMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. 
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Operation 

VMULPS (EVEX encoded version) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *ls a register* 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN 

DEST[i+31 :l] ^ SRC1 [i+31 :i] * SRC2[31:0] 

ELSE 

DEST[i+31 :i] ^ SRC1 [i+31 :l] * SRC2[i+31 :i] 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+31 :i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VMULPS (VEX.256 encoded version) 

DEST[31:0] ^SRCI [31:0] * SRC2[31:0] 

DEST[63:32] ^SRCI [63:32] * SRC2[63:32] 
DEST[95:64] ^SRCI [95:64] * SRC2[95:64] 

DEST[127:96] ^SRCI [127:96] * SRC2[127:96] 
DEST[159:128] ^SRCI [159:128] * SRC2[159:128] 
DEST[191:160]^SRC1 [191:160] * SRC2[191:160] 
DEST[223:192] ^SRCI [223:192] * SRC2[223:192] 
DEST[255:224] ^SRCI [255:224] * SRC2[255:224]. 
DEST[MAX_VL-1:256] ^0; 


VMULPS (VEX.128 encoded version) 

DEST[31:0] ^SRCI [31:0] * SRC2[31:0] 
DEST[63:32] ^SRCI [63:32] * SRC2[63:32] 
DEST[95:64] ^SRCI [95:64] * SRC2[95:64] 
DEST[127:96] ^SRCI [127:96] * SRC2[127:96] 
DEST[MAX_VL-1:128] ^0 


MULPS (128-bit Legacy SSE version) 

DEST[31:0] ^SRCI [31:0] * SRC2[31:0] 
DEST[63:32] ^SRCI [63:32] * SRC2[63:32] 
DEST[95:64] ^SRCI [95:64] * SRC2[95:64] 
DEST[127:96] ^SRCI [127:96] * SRC2[127:96] 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivalent 

VMULPS _m512 _mm512_muLps( _m512 a, _m512 b); 

VMULPS_m512 _mm512_masl<_mul_ps(_mSI 2 s,_mmasklE k,_mSI 2 a,_mSI 2 b); 

VMULPS_m512 _mm512_maskz_mul_ps(_mmasklE k,_mSI 2 a,_m512 b); 

VMULPS_m512 _mm512_mul_round_ps(_mSI 2 a,_mSI 2 b, int); 

VMULPS_m512 _mm512_mask_mul_round_ps(_m512 s,_mmaski 6 k,_mSI 2 a,_mSI 2 b, int); 

VMULPS_m512 _mm512_maskz_mul_round_ps(_mmaski 6 k,_mSI 2 a,_mSI 2 b, int); 

VMULPS_m256 _mm256_mask_mul_ps(_m256 s,_mmaskS k,_m256 a,_m256 b); 

VMULPS_m256 _mm256_maskz_mul_ps(_mmaskS k,_m256 a,_m256 b); 

VMULPS_ml 28 _mm_mask_mul_ps(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VMULPS_ml 28 _mm_maskz_mul_ps(_mmask8 k,_ml 28 a,_ml 28 b); 

VMULPS _m256 _mm256_muLps (_m256 a, _m256 b); 

MULPS_ml 28 _mm_mul_ps (_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2. 

EVEX-encoded instruction, see Exceptions Type E2. 
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MULSD—Multiply Scalar Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F2 OF 59 /r 

MULSD xmm1,xmm2/m64 

RM 

V/V 

SSE2 

Multiply the low double-precision floating-point value in 
xmm2/m64 by low double-precision floating-point 
value in xmmi. 

VEX.NDS.128.F2.0F.WIG59 /r 

VMULSD xnnm1,xmm2, xmm3/m64 

RVM 

v/v 

AVX 

Multiply the low double-precision floating-point value in 
xmm3/m64 by low double-precision floating-point 
value in xmm2. 

EVEX.NDS.LIG.F2.0F.W1 59 /r 

VMULSD xmmi {k1}[z}, xmm2, 
xmm3/nn64 {er} 

T1S 

V/V 

AVX512F 

Multiply the low double-precision floating-point value in 
xmm3/m64 by low double-precision floating-point 
value in xmm2. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiplies the low double-precision floating-point value in the second source operand by the low double-precision 
floating-point value in the first source operand, and stores the double-precision floating-point result in the destina¬ 
tion operand. The second source operand can be an XMM register or a 64-bit memory location. The first source 
operand and the destination operands are XMM registers. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAX_VL- 
1:64) of the corresponding destination register remain unchanged. 

VEX.128 and EVEX encoded version: The quadword at bits 127:64 of the destination operand is copied from the 
same bits of the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low quadword element of the destination operand is updated according to the 
writemask. 

Software should ensure VMULSD is encoded with VEX.L=0. Encoding VMULSD with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 
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Operation 

VMULSD (EVEX encoded version) 

IF (EVEX.b = 1) AND SRC2 *is a register* 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ SRC1 [63:0] * SRC2[63:0] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[63:0] ^ 0 
FI 
FI; 

ENDFOR 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

VMULSD (VEX.128 encoded version) 

DEST[63:0] ^SRCI [63:0] * SRC2[63:0] 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

MULSD (128-bit Legacy SSE version) 

DEST[63:0] ^DEST[63:0] * SRC[63:0] 

DEST[MAX_VL-1:64] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMULSD_ml 28d _mm_mask_mul_sd(_m128d s,_mmask8 k,_m128d a,_m128d b); 

VMULSD_ml 28d _mm_maskz_mul_sd(_mmask8 k,_ml 28d a,_ml 28d b); 

VMULSD_ml 28d _mm_mul_round_sd(_ml 28d a,_ml 28d b, int); 

VMULSD_ml 28d _mm_mask_mul_round_sd(_ml 28d s,_mmask8 k,_ml 28d a,_ml 28d b, int); 

VMULSD_ml 28d _mm_maskz_mul_round_sd(_mmask8 k,_m128d a,_m128d b, int); 

MULSD _m128d _mm_muLsd (_m128d a, _m128d b) 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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MULSS—Multiply Scalar Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

F3 OF 59 /r 

MULSS xmm1,xmm2/m32 

RM 

V/V 

SSE 

Multiply the low single-precision floating-point value in 
xmm2/m32 by the low single-precision floating-point 
value in xmmi. 

VEX.NDS.128.F3.0F.WIG59 /r 

VMULSS xmnn1,xmm2, xmm3/m32 

RVM 

v/v 

AVX 

Multiply the low single-precision floating-point value in 
xmm3/m32 by the low single-precision floating-point 
value in xmm2. 

EVEX.NDS.LIG.F3.0F.W0 59 /r 

VMULSS xmmi {k1}{z}, xmm2, 
xmm3/m32 {er} 

T1S 

V/V 

AVX512F 

Multiply the low single-precision floating-point value in 
xmm3/m32 by the low single-precision floating-point 
value in xmm2. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiplies the low single-precision floating-point value from the second source operand by the low single-precision 
floating-point value in the first source operand, and stores the single-precision floating-point result in the destina¬ 
tion operand. The second source operand can be an XMM register or a 32-bit memory location. The first source 
operand and the destination operands are XMM registers. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAX_VL- 
1:32) of the corresponding VMM destination register remain unchanged. 

VEX. 128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three 
high-order doublewords of the destination operand are copied from the first source operand. Bits (MAX_VL-1:128) 
of the destination register are zeroed. 

EVEX encoded version: The low doubleword element of the destination operand is updated according to the 
writemask. 

Software should ensure VMULSS is encoded with VEX.L=0. Encoding VMULSS with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 
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Operation 

VMULSS (EVEX encoded version) 

IF (EVEX.b = 1) AND SRC2 *is a register* 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

IF k1 [0] or *no writemask* 

THEN DEST[31:0] ^ SRC1 [31:0] * SRC2[31:0] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[31:0]^0 
FI 
FI; 

ENDFOR 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128]^0 

VMULSS (VEX.128 encoded version) 

DEST[31:0] ^SRCI [31:0] * SRC2[31:0] 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

MULSS (128-bit Legacy SSE version) 

DEST[31:0] ^DEST[31:0] * SRC[31:0] 

DEST[MAX_VL-1:32] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VMULSS_ml 28 _mm_mask_mul_ss(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VMULSS_m128_mm_maskz_mul_ss(_mmask8 k,_ml 28 a,_ml 28 b); 

VMULSS_ml 28 _mm_mul_round_ss(_ml 28 a,_ml 28 b, int); 

VMULSS_ml 28 _mm_mask_mul_round_ss(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b, int); 

VMULSS_ml 28 _mm_maskz_mul_round_ss(_mmask8 k,_ml 28 a,_ml 28 b, int); 

MULSS _m128 _mm_muLss(_m128 a, _m128 b) 

SIMD Floating-Point Exceptions 

Underflow, Overflow, Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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MULX — Unsigned Multi 

ply Without Affecting Flags 

Opcode/ 

Instruction 

Op/ 

En 

64/32 

-bit 

Mode 

CPUID 

Feature 

Fiag 

Description 

VEX.NDD.LZ.F2.0F38.W0 F6 /r 
MULX r32a, r32b, r/m32 

RVM 

V/V 

BMI2 

Unsigned multiply of r/m32 with EDX without affecting arithmetic 
flags. 

VEX.NDD.LZ.F2.0F38.W1 F6 /r 
MULX r64a, r64b, r/m64 

RVM 

V/N.E. 

BMI2 

Unsigned multiply of r/m64 with RDX without affecting arithmetic 
flags. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RVM 

ModRM:reg (w) 

VEX.vvvv (w) 

ModRM:r/m (r) 

RDX/EDX is implied 64/32 bits 
source 


Description 

Performs an unsigned multiplication of the implicit source operand (EDX/RDX) and the specified source operand 
(the third operand) and stores the low half of the result in the second destination (second operand), the high half 
of the result in the first destination operand (first operand), without reading or writing the arithmetic flags. This 
enables efficient programming where the software can interleave add with carry operations and multiplications. 

If the first and second operand are identical, it will contain the high half of the multiplication result. 

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in 
64-bit mode. In 64-bit mode operand size 64 requires VEX.Wl. VEX.Wl is ignored in non-64-bit modes. An 
attempt to execute this instruction with VEX.L not equal to 0 will cause #UD. 

Operation 

// DEST1:ModRM:reg 
// DEST2: VEX.vvvv 
IF (OperandSIze = 32) 

SRC1 ^ EDX; 

DEST2^(SRC1*SRC2)[31:0]; 

DEST1 ^(SRC1*SRC2)[63:32]; 

ELSE IF (OperandSIze = 64) 

SRC1 ^ RDX; 

DEST2^(SRC1*SRC2)[63:0]; 

DEST1 ^(SRC1*SRC2)[127:64]; 

FI 

Flags Affected 

None 

Intel C/C++ Compiler Intrinsic Equivalent 

Auto-generated from high-level language when possible. 

unsigned int mulx_u32(unsigned int a, unsigned int b, unsigned int * hi); 

unsigned_int64 mulx_u64(unsigned_int64 a, unsigned_int64 b, unsigned_int64 * hi); 

SIMD Floating-Point Exceptions 

None 
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Other Exceptions 

See Section 2.5.1, "Exception Conditions for VEX-Encoded GPR Instructions", Table 2-29; additionally 
#UD IfVEX.W=l. 
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MWAIT—Monitor Wait 


Opcode 

Instruction 

Op/ 

Gn 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 C9 

MWAIT 

NP 

Valid 

Valid 

A hint that allow the processor to stop 
instruction execution and enter an 
implementation-dependent optimized state 
until occurrence of a class of events. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

MWAIT instruction provides hints to allow the processor to enter an implementation-dependent optimized state. 
There are two principal targeted usages: address-range monitor and advanced power management. Both usages 
of MWAIT require the use of the MONITOR instruction. 

CPUID.01H:ECX.MONITOR[bit 3] indicates the availability of MONITOR and MWAIT in the processor. When set, 
MWAIT may be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode 
exception). The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE 
MSR; disabling MWAIT clears the CPUID feature flag and causes execution to generate an invalid-opcode excep¬ 
tion. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

ECX specifies optional extensions for the MWAIT instruction. EAX may contain hints such as the preferred optimized 
state the processor should enter. The first processors to implement MWAIT supported only the zero value for EAX 
and ECX. Later processors allowed setting ECX[0] to enable masked interrupts as break events for MWAIT (see 
below). Software can use the CPUID instruction to determine the extensions and hints supported by the processor. 

MWAIT for Address Range Monitoring 

For address-range monitoring, the MWAIT instruction operates with the MONITOR instruction. The two instructions 
allow the definition of an address at which to wait (MONITOR) and a implementation-dependent-optimized opera¬ 
tion to commence at the wait address (MWAIT). The execution of MWAIT is a hint to the processor that it can enter 
an implementation-dependent-optimized state while waiting for an event or a store operation to the address range 
armed by MONITOR. 

The following cause the processor to exit the implementation-dependent-optimized state: a store to the address 
range armed by the MONITOR instruction, an NMI or SMI, a debug exception, a machine check exception, the 
BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may also cause 
the processor to exit the implementation-dependent-optimized state. 

In addition, an external interrupt causes the processor to exit the implementation-dependent-optimized state 
either (1) if the interrupt would be delivered to software (e.g., as it would be if HLT had been executed instead of 
MWAIT); or (2) if ECX[0] = 1. Software can execute MWAIT with ECX[0] = 1 only if CPUID.05H:ECX[bit 1] = 1. 
(Implementation-specific conditions may result in an interrupt causing the processor to exit the implementation- 
dependent-optimized state even if interrupts are masked and ECX[0] = 0.) 

Following exit from the implementation-dependent-optimized state, control passes to the instruction following the 
MWAIT instruction. A pending interrupt that is not masked (including an NMI or an SMI) may be delivered before 
execution of that instruction. Unlike the HLT instruction, the MWAIT instruction does not support a restart at the 
MWAIT instruction following the handling of an SMI. 

If the preceding MONITOR instruction did not successfully arm an address range or if the MONITOR instruction has 
not been executed prior to executing MWAIT, then the processor will not enter the implementation-dependent-opti¬ 
mized state. Execution will resume at the instruction following the MWAIT. 
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MWAIT for Power Management 

MWAIT accepts a hint and optional extension to the processor that it can enter a specified target C state while 
waiting for an event or a store operation to the address range armed by MONITOR. Support for MWAIT extensions 
for power management is indicated by CPUID.05H:ECX[bit 0] reporting 1. 

EAX and ECX are used to communicate the additional information to the MWAIT instruction, such as the kind of 
optimized state the processor should enter. ECX specifies optional extensions for the MWAIT instruction. EAX may 
contain hints such as the preferred optimized state the processor should enter. Implementation-specific conditions 
may cause a processor to ignore the hint and enter a different optimized state. Future processor implementations 
may implement several optimized "waiting" states and will select among those states based on the hint argument. 

Table 4-10 describes the meaning of ECX and EAX registers for MWAIT extensions. 


Table 4-10. MWAIT Extension Register (ECX) 


Bits 

Description 

0 

Treat interrupts as break events even if masked (e.g., even if EFLAGS.IF=0). May be set only if 
CPUID.05H:ECX[bit 1] = 1. 

31: 1 

Reserved 


Table 4-11. MWAIT Hints Register (EAX) 


Bits 

Description 

O 

CO 

Sub C-state within a C-state, indicated by bits [7:4] 

7 :4 

Target C-state* 

Value of 0 means Cl; 1 means C2 and so on 

Value of 01111B means CO 

Note: Target C states for MWAIT extensions are processor-specific C-states, not ACPI C-states 

31:8 

Reserved 


Note that if MWAIT is used to enter any of the C-states that are numerically higher than Cl, a store to the address 
range armed by the MONITOR instruction will cause the processor to exit MWAIT only if the store was originated by 
other processor agents. A store from non-processor agent might not cause the processor to exit MWAIT in such 
cases. 

For additional details of MWAIT extensions, see Chapter 14, "Power and Thermal Management," of I ntel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 3A. 

Operation 

(* MWAIT takes the argument in EAX as a hint extension and is architected to take the argument in ECX as an instruction extension 
MWAIT EAX, ECX *) 

{ 

WHILE (("Monitor Hardware is in armed state")) [ 

implementation_dependent_optimized_state(EAX, ECX); ] 

Set the state of Monitor Hardware as triggered; 

} 

Intel C/C++ Compiler Intrinsic Equivalent 

MWAIT: void _mm_mwait(unsigned extensions, unsigned hints) 


MWAIT—Monitor Wait 
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Example 

MONITOR/MWAIT instruction pair must be coded in the same loop because execution of the MWAIT instruction will 
trigger the monitor hardware. It is not a proper usage to execute MONITOR once and then execute MWAIT in a 
loop. Setting up MONITOR without executing MWAIT has no adverse effects. 

Typically the MONITOR/MWAIT pair is used in a sequence, such as: 

EAX = Logical Address(Trigger) 

ECX = 0 (*Hlnts *) 

EDX = 0 (* Hints *) 

IF (!trlgger_store_happened) [ 

MONITOR EAX, ECX, EDX 
IF (ltrigger_store_happened ) { 

MWAIT EAX, ECX 

} 

} 

The above code sequence makes sure that a triggering store does not happen between the first check of the trigger 
and the execution of the monitor instruction. Without the second check that triggering store would go un-noticed. 
Typical usage of MONITOR and MWAIT would have the above code sequence within a loop. 

Numeric Exceptions 

None 

Protected Mode Exceptions 


#GP(0) 


If ECX[31:1] 0. 

If ECX[0] = 1 and CPUID.05H:ECX[bit 1] = 0. 
If CPUID.01H:ECX.MONITOR[bit 3] = 0. 

If current privilege level is not 0. 


#UD 


Real Address Mode 

#GP 


Exceptions 

If ECX[31:1] 

If ECX[0] = 1 and CPUID.05H:ECX[bit 1] = 0. 
If CPUID.01H:ECX.MONITOR[bit 3] = 0. 


#UD 


Virtual 8086 Mode Exceptions 


#UD 


The MWAIT instruction is not recognized in virtual-8086 mode (even if 
CPUID.01H:ECX.MONITOR[bit 3] = 1). 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 


64-Bit Mode Exceptions 


#GP(0) 


If RCX[63:1] 

If RCX[0] = 1 and CPUID.05H:ECX[bit 1] = 0. 
If the current privilege level is not 0. 

If CPUID.01H:ECX.MONITOR[bit 3] = 0. 


#UD 
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NEC—Two's Complement Negation 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F6 /3 

NEG r/mS 

M 

Valid 

Valid 

Two's complement negate r/mS. 

REX + F6 /3 

NEC r/mS* 

M 

Valid 

N.E. 

Two's complement negate r/mS. 

F7 /3 

NEG r/m 7 6 

M 

Valid 

Valid 

Two's complement negate r/ml6. 

F7 /3 

NEG r/m32 

M 

Valid 

Valid 

Two's complement negate r/m32. 

REX.W + F7 /3 

NEG r/m64 

M 

Valid 

N.E. 

Two's complement negate r/m64. 


NOTES: 

* In 64-blt mode, r/mS can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r, w) 

NA 

NA 

NA 


Description 

Replaces the value of operand (the destination operand) with its two's complement. (This operation is equivalent 
to subtracting the operand from 0.) The destination operand is located in a general-purpose register or a memory 
location. 

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See 
the summary chart at the beginning of this section for encoding data and limits. 

Operation 

IF BEST = 0 
THEN CF ^ 0; 

ELSE CF ^ 1; 

FI; 

BEST ^ [- (BEST)] 

Flags Affected 

The CF flag set to 0 if the source operand is 0; otherwise it is set to 1. The OF, SF, ZF, AF, and PF flags are set 
according to the result. 


Protected Mode Exceptions 


#GP(0) 


#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If the LOCK prefix is used but the destination is not a memory operand. 


NEC—Two's Complement Negation 
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Real-Address Mode 

#GP 

#SS 

#UD 


Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or 
If a memory operand effective address is outside the SS segment limit. 
If the LOCK prefix is used but the destination is not a memory operand. 


GS segment limit. 


Virtual-SOSe Mode 

#GP(0) 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 

If the LOCK prefix is used but the destination is not a memory operand. 


Compatibility Mode Exceptions 

Same as for protected mode exceptions. 


64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) For a page fault. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 
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NOP—No Operation 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

90 

NOP 

NP 

Valid 

Valid 

One byte no-operation instruction. 

OF 1F /O 

N0Pr/m16 

M 

Valid 

Valid 

Multi-byte no-operation instruction. 

OF 1F /O 

NOP r/m32 

M 

Valid 

Valid 

Multi-byte no-operation instruction. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 

M 

ModRM:r/m (r) 

NA 

NA 

NA 


Description 

This instruction performs no operation. It is a one-byte or multi-byte NOP that takes up space in the instruction 
stream but does not impact machine context, except for the EIP register. 

The multi-byte form of NOP is available on processors with model encoding: 

• CPUID.01H.EAX[Bytes 11:8] = OllOB or llllB 

The multi-byte NOP instruction does not alter the content of a register and will not issue a memory operation. The 
instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

The one-byte NOP instruction is an alias mnemonic for the XCHG (E)AX, (E)AX instruction. 

The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode 
exception on processors that do not support the multi-byte NOP instruction. 

The memory operand form of the instruction allows software to create a byte sequence of "no operation" as one 
instruction. For situations where multiple-byte NOPs are needed, the recommended operations (32-bit mode and 
64-bit mode) are: 


Table 4-12. Recommended Multi-Byte Sequence of NOP Instruction 


Length 

Assembly 

Byte Sequence 

2 bytes 

66 NOP 

66 90H 

3 bytes 

NOP DWORD ptr [EAX] 

OF IF OOH 

4 bytes 

NOP DWORD ptr [EAX + OOH] 

OF IF40 OOH 

5 bytes 

NOP DWORD ptr [EAX + EAX*1 + OOH] 

OF 1F 44 00 OOH 

6 bytes 

66 NOP DWORD ptr [EAX + EAX*1 + OOH] 

66 OF IF44 00 OOH 

7 bytes 

NOP DWORD ptr [EAX + OOOOOOOOH] 

OF 1F 80 00 00 00 OOH 

8 bytes 

NOP DWORD ptr [EAX + EAX*1 + OOOOOOOOH] 

OF 1F 84 00 00 00 00 OOH 

9 bytes 

66 NOP DWORD ptr [EAX + EAX*1 + OOOOOOOOH] 

66 OF 1F 84 00 00 00 00 OOH 


Flags Affected 

None 

Exceptions (All Operating Modes) 

#UD If the LOCK prefix is used. 


NOP—No Operation 
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NOT—One's Complement Negation 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F6 /2 

NOT r/mS 

M 

Valid 

Valid 

Reverse each bit of r/mS. 

REX + F6 /2 

NOT r/mS* 

M 

Valid 

N.E. 

Reverse each bit of r/m8. 

F7 /2 

NOT r/m 7 6 

M 

Valid 

Valid 

Reverse each bit of r/m 16. 

F7 /2 

NOT r/m32 

M 

Valid 

Valid 

Reverse each bit of r/m32. 

REX.W + F7 /2 

NOT r/m64 

M 

Valid 

N.E. 

Reverse each bit of r/m64. 


NOTES: 

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r, w) 

NA 

NA 

NA 


Description 

Performs a bitwise NOT operation (each 1 is set to 0, and each 0 is set to 1) on the destination operand and stores 
the result in the destination operand location. The destination operand can be a register or a memory location. 

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See 
the summary chart at the beginning of this section for encoding data and limits. 

Operation 

DEST ^ NOT DEST; 

Flags Affected 

None 

Protected Mode Exceptions 

#GP(0) If the destination operand points to a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 
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\/irtual-8086 Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 

Compatibility Mode Exceptions 

Same as for protected mode exceptions. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 


NOT—One's Complement Negation 
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OR—Logical Inclusive OR 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OC/b 

OR AL, immS 

1 

Valid 

Valid 

AL OR imm8. 

OD iw 

OR AX, \mml6 

1 

Valid 

Valid 

AX OR imm 7 6. 

OD id 

OR EAX, \mm32 

1 

Valid 

Valid 

EAX OR imm32. 

REX.W + OD id 

OR RAX, imm3^ 

1 

Valid 

N.E. 

RAX OR imm32 (sign-extended). 

80 n ib 

OR r/m8, imm8 

Ml 

Valid 

Valid 

r/m8 OR imm8. 

REX + 80 /I ib 

OR r/mS* imm8 

Ml 

Valid 

N.E. 

r/m8 OR imm8. 

81 n iw 

OR r/m 7 6, imm 16 

Ml 

Valid 

Valid 

r/m76 OR immIB. 

81 n id 

OR r/m32, imm32 

Ml 

Valid 

Valid 

r/m32 OR imm32. 

REX.W + 81 /I /d 

OR r/m64, imm32 

Ml 

Valid 

N.E. 

r/m64 OR imm32 (sign-extended). 

83 n ib 

OR r/m 7 6, imm8 

Ml 

Valid 

Valid 

r/m 7 6 OR imm8 (sign-extended). 

83 /I ib 

OR r/m32, imm8 

Ml 

Valid 

Valid 

r/m32 OR imm8 (sign-extended). 

REX.W + 83 /I ib 

OR r/m64, imm8 

Ml 

Valid 

N.E. 

r/m64 OR imm8 (sign-extended). 

08 /r 

OR r/m8, r8 

MR 

Valid 

Valid 

r/m8 OR rS. 

REX + 08 /r 

OR r/m8* r8* 

MR 

Valid 

N.E. 

r/mS OR rS. 

09 /r 

OR r/m 16, r16 

MR 

Valid 

Valid 

r/ml 6 OR r16. 

09 /r 

OR r/m32, r32 

MR 

Valid 

Valid 

r/m32 OR r32. 

REX.W + 09 /r 

OR r/m64, r64 

MR 

Valid 

N.E. 

r/m64 OR r64. 

OA /r 

OR r8, r/m8 

RM 

Valid 

Valid 

r8 OR r/mS. 

REX + OA Ir 

OR r8* r/m8* 

RM 

Valid 

N.E. 

r8 OR r/mS. 

08 Ir 

OR ri 6, r/ml6 

RM 

Valid 

Valid 

r16 OR r/mi 6. 

08 Ir 

OR r32, r/m32 

RM 

Valid 

Valid 

r32 OR r/m32. 

REX.W + 08 Ir 

OR r64, r/m64 

RM 

Valid 

N.E. 

r64 OR r/m64. 


NOTES: 

* In 64-blt mode, r/mS can not be encoded to access the following byte registers If a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

1 

AL/AX/EAX/RAX 

imm8/16/32 

NA 

NA 

Ml 

ModRM:r/m (r, w) 

imm8/16/32 

NA 

NA 

MR 

ModRM:r/m (r, w) 

ModRM:reg (r) 

NA 

NA 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Performs a bitwise inclusive OR operation between the destination (first) and source (second) operands and stores 
the result in the destination operand location. The source operand can be an immediate, a register, or a memory 
location; the destination operand can be a register or a memory location. (However, two memory operands cannot 
be used in one instruction.) Each bit of the result of the OR instruction is set to 0 if both corresponding bits of the 
first and second operands are 0; otherwise, each bit is set to 1. 

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. 
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In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See 
the summary chart at the beginning of this section for encoding data and limits. 

Operation 

DEST ^ DEST OR SRC; 

Flags Affected 

The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is 
undefined. 


Protected Mode Exceptions 


#GP(0) 


#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If the destination operand points to a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If the LOCK prefix is used but the destination is not a memory operand. 


Real-Address Mode 

#GP 

#SS 

#UD 


Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If the LOCK prefix is used but the destination is not a memory operand. 


Virtual-SOSe Mode 

#GP(0) 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 

If the LOCK prefix is used but the destination is not a memory operand. 


Compatibility Mode Exceptions 

Same as for protected mode exceptions. 


64-Bit Mode Exceptions 


#SS(0) 

#GP(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If a memory address referencing the SS segment is in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If the LOCK prefix is used but the destination is not a memory operand. 


OR—Logical Inclusive OR 


Vol. 2B 4-167 


INSTRUCTION SET REFERENCE, M-U 


ORPD—Bitwise Logical OR of Packed Double Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

Gn 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 56/r 

ORPD xmmi, xnnm2/m128 

RM 

V/V 

SSE2 

Return the bitwise logical OR of packed double-precision 
floating-point values in xmmi and xmm2/mem. 

VEX.NDS.128.66.0F56/r 

VORPD xmm1,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Return the bitwise logical OR of packed double-precision 
floating-point values in xmm2 and xmm3/mem. 

VEX.NDS.256.66.0F 56 /r 

VORPD ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Return the bitwise logical OR of packed double-precision 
floating-point values in ymm2 and ymm3/mem. 

EVEX.NDS.128.66.0F.W1 56/r 

VORPD xmmi {k1}{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512DQ 

Return the bitwise logical OR of packed double-precision 
floating-point values in xmm2 and xmm3/m128/m64bcst 
subject to writemask kl. 

EVEX.NDS.256.66.0F.W1 56 /r 

VORPD ymmi [k1}[z], ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512DQ 

Return the bitwise logical OR of packed double-precision 
floating-point values in ymm2 and ymm3/m256/m64bcst 
subject to writemask kl. 

EVEX.NDS.512.66.0F.W1 56/r 

VORPD zmmi [kl }{z], zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512DQ 

Return the bitwise logical OR of packed double-precision 
floating-point values in zmm2 and zmm3/m512/m64bcst 
subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a bitwise logical OR of the two, four or eight packed double-precision floating-point values from the first 
source operand and the second source operand, and stores the result in the destination operand. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, ora 512/256/128-bit vector broadcasted from a 
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with 
writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register 
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) of the 
corresponding ZMM register destination are zeroed. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
register destination are unmodified. 
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Operation 

VORPD (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* 

THEN 

IF (EVEX.b == 1) AND (SRC2 *ls memory*) 

THEN 

DEST[I+63:I] ^ SRC1 [1+63:1] BITWISE OR SRC2[63:0] 

ELSE 

DEST[I+63:I] ^ SRC1 [1+63:1] BITWISE OR SRC2[i+63:l] 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VORPD (VEX.256 encoded version) 

DEST[63:0] ^ SRC1[63:0] BITWISE OR SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64] BITWISE OR SRC2[127:64] 

DEST[191:128] ^ SRC1 [191:128] BITWISE OR SRC2[191:128] 

DEST[255:192] ^ SRC1 [255:192] BITWISE OR SRC2[255:192] 

DEST[MAX_VL-1:256]^0 

VORPD (VEX.128 encoded version) 

DEST[63:0] ^ SRC1[63:0] BITWISE OR SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64] BITWISE OR SRC2[127:64] 

DEST[MAX_VL-1:128]^0 

ORPD (128-bit Legacy SSE version) 

DEST[63:0] ^ DEST[63:0] BITWISE OR SRC[63:0] 

DEST[127:64] ^ DEST[127:64] BITWISE OR SRC[127:64] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VORPD _m512d _mm512_or_pd (_m512d a, _m512d b); 

VORPD_m512d _mm512_mask_or_pd (_m512d s,_mmaskB k,_m512d a,_m512d b); 

VORPD_m512d _mm512_maskz_or_pd (_mmaskB k,_m512d a,_m512d b); 

VORPD_m256d _mm256_mask_or_pd (_m256d s,_mmaskB k,_m256d a,_m256d b); 

VORPD_m256d _mm256_maskz_or_pd (_mmaskB k,_m256d a,_m256d b); 

VORPD_ml 28d _mm_mask_or_pd (_ml 28d s,_mmaskB k,_ml 28d a,_ml 28d b); 

VORPD_m128d_mm_maskz_or_pd (_mmaskB k,_m128d a,_m128d b); 

VORPD _m256d _mm256_or_pd (_m256d a, _m256d b); 

ORPD_ml 28d _mm_or_pd (_ml 28d a,_ml 28d b); 
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SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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ORPS—Bitwise Logical OR of Packed Single Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 56 /r 

ORPS xmmi, xmm2/m128 

RM 

V/V 

SSE 

Return the bitwise logical OR of packed single-precision 
floating-point values in xmmi and xmm2/mem. 

VEX.NDS.128.0F 56 /r 

VORPS xmm1,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Return the bitwise logical OR of packed single-precision 
floating-point values in xmm2 and xmm3/mem. 

VEX.NDS.256.0F 56 /r 

VORPS ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Return the bitwise logical OR of packed single-precision 
floating-point values in ymm2 and ymm3/mem. 

EVEX.NDS.128.0F.W0 56 /r 

VORPS xmmi {k1}[z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512DQ 

Return the bitwise logical OR of packed single-precision 
floating-point values In xmm2 and xmm3/m128/m32bcst 
subject to writemask kl. 

EVEX.NDS.256.0F.W0 56 /r 

VORPS ymmi {k1}{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512DQ 

Return the bitwise logical OR of packed single-precision 
floating-point values In ymm2 and ymm3/m256/m32bcst 
subject to writemask kl. 

EVEX.NDS.512.0F.W0 56 /r 

VORPS zmmi {k1}{z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512DQ 

Return the bitwise logical OR of packed single-precision 
floating-point values In zmm2 and zmm3/m512/m32bcst 
subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a bitwise logical OR of the four, eight or sixteen packed single-precision floating-point values from the 
first source operand and the second source operand, and stores the result in the destination operand 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with 
writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register 
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) of the 
corresponding ZMM register destination are zeroed. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
register destination are unmodified. 
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Operation 

VORPS (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b == 1) AND (SRC2 *is memory*) 

THEN 

DEST[i+31 :l] ^ SRC1 [i+31 :i] BITWISE OR SRC2[31:0] 

ELSE 

DEST[I+31 :l] ^ SRC1 [i+31 :i] BITWISE OR SRC2[I+31 :i] 
FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VORPS (VEX.256 encoded version) 

DEST[31:0] ^ SRC1 [31:0] BITWISE OR SRC2[31:0] 

DEST[63:32] ^ SRC1 [63:32] BITWISE OR SRC2[63:32] 
DEST[95:64] ^ SRC1 [95:64] BITWISE OR SRC2[95:64] 

DEST[127:96] ^ SRC1 [127:96] BITWISE OR SRC2[127:96] 
DEST[159:128] ^ SRC1 [159:128] BITWISE OR SRC2[159:128] 
DEST[191:160] ^ SRC1 [191:160] BITWISE OR SRC2[191:160] 
DEST[223:192] ^ SRC1 [223:192] BITWISE OR SRC2[223:192] 
DEST[255:224] ^ SRC1 [255:224] BITWISE OR SRC2[255:224]. 
DEST[MAX_VL-1:256]^0 


VORPS (VEX.128 encoded version) 

DEST[31:0] ^ SRC1 [31:0] BITWISE OR SRC2[31:0] 
DEST[63:32] ^ SRC1 [63:32] BITWISE OR SRC2[63:32] 
DEST[95:64] ^ SRC1 [95:64] BITWISE OR SRC2[95:64] 
DEST[127:96] ^ SRC1 [127:96] BITWISE OR SRC2[127:96] 
DEST[MAX_VL-1:128]^0 


ORPS (128-bit Legacy SSE version) 

DEST[31:0] ^ SRC1 [31:0] BITWISE OR SRC2[31:0] 
DEST[63:32] ^ SRC1 [63:32] BITWISE OR SRC2[63:32] 
DEST[95:64] ^ SRC1 [95:64] BITWISE OR SRC2[95:64] 
DEST[127:96] ^ SRC1 [127:96] BITWISE OR SRC2[127:96] 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivalent 

VORPS_m512 _mm512_or_ps (_m512 a,_m512 b); 

VORPS_mSI 2 _mm512_masl<_or_ps (_m512 s,_mmasklE k,_m512 a,_m512 b); 

VORPS_mSI 2 _mm512_maskz_or_ps (_mmasklE k,_mSI 2 a,_mSI 2 b); 

VORPS_m256 _mm256_mask_or_ps (_m256 s,_mmaskS k,_m256 a,_m256 b); 

VORPS_m256 _mm256_maskz_or_ps (_mmaskS k,_m256 a,_m256 b); 

VORPS_ml 28 _mm_mask_or_ps (_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VORPS_ml 28 _mm_maskz_or_ps (_mmask8 k,_ml 28 a,_ml 28 b); 

VORPS _m256 _mm256_or_ps (_m256 a, _m256 b); 

ORPS_ml 28 _mm_or_ps (_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4. 
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OUT—Output to Port 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

E6/5 

OUT imm8, AL 

1 

Valid 

Valid 

Output byte in AL to I/O port address imm8. 

E7 ib 

OUT imm8, AX 

1 

Valid 

Valid 

Output word in AX to I/O port address imm8. 

E7 ib 

OUT imm8 EAX 

1 

Valid 

Valid 

Output doubleword In EAX to I/O port address 
imm8. 

EE 

OUT DX, AL 

NP 

Valid 

Valid 

Output byte In AL to I/O port address In DX. 

EF 

OUT DX, AX 

NP 

Valid 

Valid 

Output word in AX to I/O port address In DX. 

EF 

OUT DX, EAX 

NP 

Valid 

Valid 

Output doubleword In EAX to I/O port address 
In DX. 


NOTES: 

* See IA-32 Architecture Compatibility section below. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

1 

Imm8 

NA 

NA 

NA 

NP 

NA 

NA 

NA 

NA 


Description 

Copies the value from the second operand (source operand) to the I/O port specified with the destination operand 
(first operand). The source operand can be register AL, AX, or EAX, depending on the size of the port being 
accessed (8, 16, or 32 bits, respectively); the destination operand can be a byte-immediate or the DX register. 
Using a byte immediate allows I/O port addresses 0 to 255 to be accessed; using the DX register as a source 
operand allows I/O ports from 0 to 65,535 to be accessed. 

The size of the I/O port being accessed is determined by the opcode for an 8-bit I/O port or by the operand-size 
attribute of the instruction for a 16- or 32-bit I/O port. 

At the machine code level, I/O instructions are shorter when accessing 8-bit I/O ports. Here, the upper eight bits 
of the port address will be 0. 

This instruction is only useful for accessing I/O ports located in the processor's I/O address space. See Chapter 18, 
"Input/Output," in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for more infor¬ 
mation on accessing I/O ports in the I/O address space. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

IA-32 Architecture Compatibility 

After executing an OUT instruction, the Pentium® processor ensures that the EWBE# pin has been sampled active 
before it begins to execute the next instruction. (Note that the instruction can be prefetched if EWBE# is not active, 
but it will not be executed until the EWBE# pin is sampled active.) Only the Pentium processor family has the 
EWBE# pin. 
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Operation 

IF ((PE = 1) and ((CPL > lOPL) or (VM = 1))) 

TFIEN (* Protected mode with CPL > lOPL or virtual-8086 mode *) 

IF (Any I/O Permission Bit for I/O port being accessed = 1) 

TFIEN (* I/O operation Is not allowed *) 

#GP(0); 

ELSE (* I/O operation is allowed *) 

DEST SRC; (* Writes to selected I/O port *) 

FI; 

ELSE (Real Mode or Protected Mode with CPL < lOPL *) 

DEST SRC; (* Writes to selected I/O port *) 

FI; 

Flags Affected 

None 

Protected Mode Exceptions 

#GP(0) If the CPL is greater than (has less privilege) the I/O privilege level (lOPL) and any of the 

corresponding I/O permission bits in TSS for the I/O port being accessed is 1. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 

\/irtual-8086 Mode Exceptions 

#GP(0) If any of the I/O permission bits in the TSS for the I/O port being accessed is 1. 

#PF(fault-code) If a page fault occurs. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same as protected mode exceptions. 

e4-Bit Mode Exceptions 

Same as protected mode exceptions. 
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OUTS/OUTSB/OUTSW/OUTSD-Output String to Port 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

6E 

OUTS DX, m8 

NP 

Valid 

Valid 

Output byte from memory location specified 
in DS:(E)SI or RSI to I/O port specified in DX**. 

6F 

OUTS DX,m76 

NP 

Valid 

Valid 

Output word from memory location specified 
in DS:(E)SI or RSI to I/O port specified in DX**. 

6F 

OUTS DX, m32 

NP 

Valid 

Valid 

Output doubleword from memory location 
specified In DS:(E)SI or RSI to I/O port specified 
in DX**. 

6E 

OUTSB 

NP 

Valid 

Valid 

Output byte from memory location specified 
in DS:(E)SI or RSI to I/O port specified in DX**. 

6F 

OUTSW 

NP 

Valid 

Valid 

Output word from memory location specified 
in DS:(E)SI or RSI to I/O port specified in DX**. 

6F 

OUTSD 

NP 

Valid 

Valid 

Output doubleword from memory location 
specified in DS:(E)SI or RSI to I/O port specified 
in DX**. 


NOTES: 

* See IA-32 Architecture Compatibility section below. 

** In 64-blt mode, only 64-bit (RSI) and 32-bit (ESI) address sizes are supported. In non-64-blt mode, only 32-blt (ESI) and 16-bit (SI) 


address sizes are supported. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Copies data from the source operand (second operand) to the I/O port specified with the destination operand (first 
operand). The source operand is a memory location, the address of which is read from either the DS:SI, DS:ESI or 
the RSI registers (depending on the address-size attribute of the instruction, 16, 32 or 64, respectively). (The DS 
segment may be overridden with a segment override prefix.) The destination operand is an I/O port address (from 
0 to 65,535) that is read from the DX register. The size of the I/O port being accessed (that is, the size of the source 
and destination operands) is determined by the opcode for an 8-bit I/O port or by the operand-size attribute of the 
instruction for a 16- or 32-bit I/O port. 

At the assembly-code level, two forms of this instruction are allowed: the "explicit-operands" form and the "no¬ 
operands" form. The explicit-operands form (specified with the OUTS mnemonic) allows the source and destination 
operands to be specified explicitly. Here, the source operand should be a symbol that indicates the size of the I/O 
port and the source address, and the destination operand must be DX. This explicit-operands form is provided to 
allow documentation; however, note that the documentation provided by this form can be misleading. That is, the 
source operand symbol must specify the correct type (size) of the operand (byte, word, or doubleword), but it does 
not have to specify the correct location. The location is always specified by the DS:(E)SI or RSI registers, which 
must be loaded correctly before the OUTS instruction is executed. 

The no-operands form provides "short forms" of the byte, word, and doubleword versions of the OUTS instructions. 
Here also DS:(E)SI is assumed to be the source operand and DX is assumed to be the destination operand. The size 
of the I/O port is specified with the choice of mnemonic: OUTSB (byte), OUTSW (word), or OUTSD (doubleword). 

After the byte, word, or doubleword is transferred from the memory location to the I/O port, the SI/ESI/RSI 
register is incremented or decremented automatically according to the setting of the DF flag in the EFLAGS register. 
(If the DF flag is 0, the (E)SI register is incremented; if the DF flag is 1, the SI/ESI/RSI register is decremented.) 
The SI/ESI/RSI register is incremented or decremented by 1 for byte operations, by 2 for word operations, and by 
4 for doubleword operations. 
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The OUTS, OUTSB, OUTSW, and OUTSD instructions can be preceded by the REP prefix for block input of ECX 
bytes, words, or doublewords. See "REP/REPE/REPZ /REPNE/REPNZ—Repeat String Operation Prefix" in this 
chapter for a description of the REP prefix. This instruction is only useful for accessing I/O ports located in the 
processor's I/O address space. See Chapter 18, "Input/Output," in the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 1, for more information on accessing I/O ports in the I/O address space. 

In 64-bit mode, the default operand size is 32 bits; operand size is not promoted by the use of REX.W. In 64-bit 
mode, the default address size is 64 bits, and 64-bit address is specified using RSI by default. 32-bit address using 
ESI is support using the prefix 67H, but 16-bit address is not supported in 64-bit mode. 

IA-32 Architecture Compatibility 

After executing an OUTS, OUTSB, OUTSW, or OUTSD instruction, the Pentium processor ensures that the EWBE# 
pin has been sampled active before it begins to execute the next instruction. (Note that the instruction can be 
prefetched if EWBE# is not active, but it will not be executed until the EWBE# pin is sampled active.) Only the 
Pentium processor family has the EWBE# pin. 

For the Pentium 4, Intel® Xeon®, and P6 processor family, upon execution of an OUTS, OUTSB, OUTSW, or OUTSD 
instruction, the processor will not execute the next instruction until the data phase of the transaction is complete. 

Operation 

IF ((PE = 1) and ((CPL > lOPL) or (VM = 1))) 

TFIEN (* Protected mode with CPL > lOPL or virtual-8086 mode *) 

IF (Any I/O Permission Bit for I/O port being accessed = 1) 

TFIEN (* I/O operation is not allowed *) 

#GP(0); 

ELSE (* I/O operation is allowed *) 

DEST ^ SRC; (* Writes to I/O port *) 

FI; 

ELSE (Real Mode or Protected Mode or 64-Bit Mode with CPL < lOPL *) 

DEST ^ SRC; (* Writes to I/O port *) 

FI; 


Byte transfer: 

IF 64-bit mode 
Then 

IF 64-Bit Address Size 
THEN 

IFDF=0 

THEN RSI ^ RSI RSI + 1; 
ELSE RSI ^ RSI or-1; 

FI; 

ELSE (* 32-Bit Address Size *) 
IFDF=0 

THEN ESI ^ ESI+ 1; 

ELSE ESI ^ ESI-1; 

FI; 

FI; 

ELSE 

IFDF=0 

THEN (E)SI^(E)SI + 1; 

ELSE (E)SI^(E)SI-1; 

FI; 

FI; 

Word transfer: 

IF 64-bit mode 
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Then 

IF 64-Blt Address Size 
THEN 

IFDF = 0 

THEN RSI ^ RSI RSI + 2; 
ELSE RSI ^ RSI or - 2; 

FI; 

ELSE (* 32-Blt Address Size *) 
IFDF = 0 

THEN ESI ^ ESI + 2; 
ELSE ESI ^ ESI - 2; 
FI; 

FI; 

ELSE 

IFDF = 0 

THEN (E)SI ^ (E)SI + 2; 

ELSE (E)SI ^ (E)SI - 2; 

FI; 

FI; 

Doubleword transfer: 

IF 64-bit mode 
Then 

IF 64-Bit Address Size 
THEN 

IFDF = 0 

THEN RSI ^ RSI RSI -r 4; 
ELSE RSI ^ RSI or - 4; 

FI; 

ELSE (* 32-Blt Address Size *) 
IFDF = 0 

THEN ESI ^ ESI-H 4; 
ELSE ESI ^ ESI - 4; 
FI; 

FI; 

ELSE 

IFDF = 0 

THEN (E)SI ^ (E)SI-H 4; 

ELSE (E)SI ^ (E)SI - 4; 

FI; 

FI; 

Flags Affected 

None 
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Protected Mode Exceptions 

#GP(0) If the CPL is greater than (has less privilege) the I/O privilege level (lOPL) and any of the 

corresponding I/O permission bits in TSS for the I/O port being accessed is 1. 

If a memory operand effective address is outside the limit of the CS, DS, ES, FS, or GS 
segment. 

If the segment register contains a NULL segment selector. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If any of the I/O permission bits in the TSS for the 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same as for protected mode exceptions. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the CPL is greater than (has less privilege) the I/O privilege level (lOPL) and any of the 

corresponding I/O permission bits in TSS for the I/O port being accessed is 1. 

If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 


I/O port being accessed is 1. 
memory reference is made. 
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PABSB/PABSW/PABSD/PABSQ - Packed Absolute Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 1C /r' 

PABSB mm 1, mm2/m64 

RM 

V/V 

SSSE3 

Compute the absolute value of bytes In 
mm2/m64 and store UNSIGNED result in mm7. 

66 OF 38 1C/r 

PABSB xmm 1, xmm2/m 128 

RM 

v/v 

SSSE3 

Compute the absolute value of bytes in 
xmm2/m 728 and store UNSIGNED result in 
xmm 7. 

OF 38 1D /r' 

PABSW mm 1, mm2/m64 

RM 

V/V 

SSSE3 

Compute the absolute value of 16-bit integers 
in mm2/m64 and store UNSIGNED result in 
mml. 

66 OF 38 ID/r 

PABSW xmm 1, xmm2/m 128 

RM 

v/v 

SSSE3 

Compute the absolute value of 16-bit integers 
in xmm2/ml28 and store UNSIGNED result in 
xmmi. 

OF 38 1E /r' 

PABSD mm 1, mm2/m64 

RM 

v/v 

SSSE3 

Compute the absolute value of 32-bit integers 
in mm2/m64 and store UNSIGNED result in 
mml. 

66 OF 38 1E/r 

PABSD xmm 1, xmm2/m 128 

RM 

v/v 

SSSE3 

Compute the absolute value of 32-bit integers 
in xmm2/m 728 and store UNSIGNED result in 
xmmi. 

VEX.128.66.0F38.WIG 1C/r 

VPABSB xmm 1, xmm2/m 128 

RM 

v/v 

AVX 

Compute the absolute value of bytes in 
xmm2/m728and store UNSIGNED result in 
xmmi. 

VEX.128.66.0F38.WIG 1D/r 

VPABSW xmm 1, xmm2/m 128 

RM 

v/v 

AVX 

Compute the absolute value of 16- bit 
integers in xmm2/m728and store UNSIGNED 
result in xmmi. 

VEX.128.66.0F38.WIG 1E/r 

VPABSD xmm 1, xmm2/m 7 28 

RM 

v/v 

AVX 

Compute the absolute value of 32- bit 
integers in xmm2/m728and store UNSIGNED 
result in xmmi. 

VEX.256.66.0F38.WIG 1C/r 

VPABSB ymm 7, ymm2/m256 

RM 

v/v 

AVX2 

Compute the absolute value of bytes in 
ymm2/m256 and store UNSIGNED result in 
ymml. 

VEX.256.66.0F38.WIG 1D/r 

VPABSW ymm 1, ymm2/m256 

RM 

v/v 

AVX2 

Compute the absolute value of 16-bit integers 
in ymm2/m256 and store UNSIGNED result in 
ymml. 

VEX.256.66.0F38.WIG 1E/r 

VPABSD ymm 7, ymm2/m256 

RM 

v/v 

AVX2 

Compute the absolute value of 32-bit integers 
in ymm2/m256 and store UNSIGNED result in 
ymml. 

EVEX.128.66.0F38.WIG 1C/r 

VPABSB xmmi {k1}[z}, xmm2/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compute the absolute value of bytes in 
xmm2/m128 and store UNSIGNED result in 
xmmi using writemaskkl. 

EVEX.256.66.0F38.WIG 1C/r 

VPABSB ymmi {k1}[z}, ymm2/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compute the absolute value of bytes in 
ymm2/m256 and store UNSIGNED result in 
ymml using writemask k1. 

EVEX.512.66.0F38.WIG 1C/r 

VPABSB zmmi [k1}{z}, zmm2/m512 

FVM 

v/v 

AVX512BW 

Compute the absolute value of bytes in 
zmm2/m512 and store UNSIGNED result in 
zmmi using writemask k1. 

EVEX.128.66.0F38.WIG 1D/r 

VPABSW xmmi {k1}{z}, xmm2/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compute the absolute value of 16-bit integers 
in xmm2/m128 and store UNSIGNED result in 
xmmi using writemask k1. 
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EVEX.256.66.0F38.WIG 1D /r 

VPABSWymmI {k1}{z}, ymm2/m256 

FVM 

V/V 

AVX512VL 

AVX512BW 

Compute the absolute value of 16-blt Integers 
In ymm2/m256 and store UNSIGNED result in 
ymmi using writemask k1. 

EVEX.512.66.0F38.WIG 1D/r 

VPABSWzmmI (k1}{z}, zmm2/m512 

FVM 

V/V 

AVX512BW 

Compute the absolute value of 16-blt Integers 
In zmm2/m512 and store UNSIGNED result In 
zmmi using writemask k1. 

EVEX.128.66.0F38.W0 1E/r 

VPABSD xmmi [k1 }{z}, xmm2/m128/m32bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Compute the absolute value of 32-blt Integers 
In xmm2/m128/m32bcst and store UNSIGNED 
result in xmmi using writemask k1. 

EVEX.256.66.0F38.W0 1E It 

VPABSD ymmi {k1}{z}, ymm2/m256/m32bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Compute the absolute value of 32-blt Integers 
In ymm2/m256/m32bcst and store UNSIGNED 
result in ymmi using writemask k1. 

VPABSD zmmi [k1}[z}, zmm2/m512/m32bcst 

FV 

V/V 

AVX512F 

Compute the absolute value of 32-blt Integers 
In zmm2/m512/m32bcst and store UNSIGNED 
result In zmmi using writemask k1. 

EVEX.128.66.0F38.W1 1F/r 

VPABSQ xmmi {k1}{z}, xmm2/m128/m64bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Compute the absolute value of 64-blt Integers 
In xmm2/m128/m64bcst and store UNSIGNED 
result in xmmi using writemask k1. 

EVEX.256.66.0F38.W1 1F/r 

VPABSQ ymmi {k1 }{z}, ymm2/m256/m64bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Compute the absolute value of 64-blt Integers 
In ymm2/m256/m64bcst and store UNSIGNED 
result in ymmi using writemask k1. 

EVEX.512.66.0F38.W1 1F/r 

VPABSQ zmmi {k1}{z}, zmm2/m512/m64bcst 

FV 

V/V 

AVX512F 

Compute the absolute value of 64-blt Integers 
In zmm2/m512/m64bcst and store UNSIGNED 
result In zmmi using writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel* 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Qp/En 

Dperand 1 

Qperand 2 

Qperand 3 

Qperand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FVM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FV 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

PABSB/W/D computes the absolute value of each data element of the source operand (the second operand) and 
stores the UNSIGNED results in the destination operand (the first operand). PABSB operates on signed bytes, 
PABSW operates on signed 16-bit words, and PABSD operates on signed 32-bit integers. 

EVEX encoded VPABSD/Q: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, 
or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a 
ZMM/YMM/XMM register updated according to the writemask. 

EVEX encoded VPABSB/W: The source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory loca¬ 
tion. The destination operand is a ZMM/YMM/XMM register updated according to the writemask. 

VEX.256 encoded versions: The source operand is a YMM register or a 256-bit memory location. The destination 
operand is a YMM register. The upper bits (MAX_VL-1:256) of the corresponding register destination are zeroed. 

VEX. 128 encoded versions: The source operand is an XMM register or 128-bit memory location. The destination 
operand is an XMM register. The upper bits (MAX_VL-1:128) of the corresponding register destination are zeroed. 
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128-bit Legacy SSE version: The source operand can be an XMM register or an 128-bit memory location. The desti¬ 
nation is an XMM register. The upper bits (VL_MAX-1:128) of the corresponding register destination are unmodi¬ 
fied. 

VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

Operation 

PABSB with 128 bit operands: 

Unsigned DEST[7:0] ^ABS(SRC[7:0]) 

Repeat operation for 2nd through 15th bytes 
Unsigned DEST[127:120] ^ABS(SRC[127:120]) 

VPABSB with 128 bit operands: 

Unsigned DEST[7:0] ^ABS(SRC[7:0]) 

Repeat operation for 2nd through 15th bytes 
Unsigned DEST[127:120]^ABS(SRC[127:120]) 

VPABSB with 256 bit operands: 

Unsigned DEST[7:0]^ABS(SRC[7: 0]) 

Repeat operation for 2nd through 31 st bytes 
Unsigned DEST[255:248]^ABS(SRC[255:248]) 

VPABSB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 51 2) 

FOR] ^0 TO KL-1 
i ^j*8 

IF k1 [j] OR *no writemask* 

THEN 

Unsigned DEST[i+7:i] ^ ABS(SRC[i+7:i]) 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

PABSW with 128 bit operands: 

Unsigned DEST[15:0]^ABS(SRC[15:0]) 

Repeat operation for 2nd through 7th 16-bit words 
Unsigned DEST[127:112]^ABS(SRC[127:112]) 

VPABSW with 128 bit operands: 

Unsigned DEST[15:0] ^ABS(SRC[15:0]) 

Repeat operation for 2nd through 7th 16-bit words 
Unsigned DEST[127:112]^ABS(SRC[127:112]) 

VPABSW with 256 bit operands: 

Unsigned DEST[15:0]^ABS(SRC[15:0]) 

Repeat operation for 2nd through 15th 16-bit words 
Unsigned DEST[255:240] ^ABS(SRC[255:240]) 
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VPABSW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR) ^0 TO KL-1 
i 16 

IF k10] OR *no wrltemask* 

THEN 

Unsigned DEST[i+15:i] ^ ABS(SRC[i+15:i]) 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

PABSD with 128 bit operands: 

Unsigned DEST[31:0]^ABS(SRC[31:0]) 

Repeat operation for 2nd through 3rd 32-bit double words 
Unsigned DEST[127:96]^ABS(SRC[127:96]) 

VPABSD with 128 bit operands: 

Unsigned DEST[31:0]^ABS(SRC[31:0]) 

Repeat operation for 2nd through 3rd 32-bit double words 
Unsigned DEST[127:96]^ABS(SRC[127:96]) 

VPABSD with 256 bit operands: 

Unsigned DEST[31:0] ^ABS(SRC[31:0]) 

Repeat operation for 2nd through 7th 32-bit double words 
Unsigned DEST[255:224] ^ABS(SRC[255:224]) 

VPABSD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^]*32 

IF k10] OR *no wrltemask* 

THEN 

IF (EVEX.b = 1) AND (SRC *is memory*) 

THEN 

Unsigned DEST[i+31 :i] ^ ABS(SRC[31:0]) 

ELSE 

Unsigned DEST[i+31 :i] ^ ABS(SRC[i+31 :i]) 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 
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VPABSQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC *ls memory*) 

THEN 

Unsigned DEST[i+63:l] ^ ABS(SRC[63:0]) 

ELSE 

Unsigned DEST[i+63:l] ^ ABS(SRC[l+63:i]) 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPABSB_m512i _mm512_abs_epi8 (_m512i a) 

VPABSW_m5121 _mm512_abs_epi16 (_m5121 a) 

VPABSB_m512i_mm512_mask_abs_epi8 (_m5121 s,_mmask64 m,_m5121 a) 

VPABSW_m512i_mm512_mask_abs_epi16 (_m5121 s,_mmask32 m,_m5121 a) 

VPABSB_m5121 _mm512_maskz_abs_epi8 (_mmask64 m,_m5121 a) 

VPABSW_m5121 _mm512_maskz_abs_epi16 (_mmask32 m,_m5121 a) 

VPABSB_m256i _mm256_mask_abs_epi8 (_m256i s,_mmask32 m,_m256i a) 

VPABSW_m256i _mm256_mask_abs_epi16 (_m256i s,_mmaski 6 m,_m256i a) 

VPABSB_m256i _mm256_maskz_abs_epi8 (_mmask32 m,_m256i a) 

VPABSW_m256i _mm256_maskz_abs_epi16 (_mmaski 6 m,_m256i a) 

VPABSB ml 281 _mm_mask_abs_epi8 ( ml 281 s, mmaski 6 m, ml 281 a) 

VPABSW ml 281 _mm_mask_abs_epi16 ( ml 281 s, mmask8 m, ml 28i a) 

VPABSB_ml 28i _mm_maskz_abs_epi8 (_mmaski 6 m,_ml 28i a) 

VPABSW_ml 28i _mm_maskz_abs_epi16 (_mmaskB m,_ml 281 a) 

VPABSD_m256i _mm256_mask_abs_epi32(_m256i s,_mmask8 k,_m256i a); 

VPABSD_m256i _mm256_maskz_abs_epi32(_mmask8 k,_m256i a); 

VPABSD_ml 281 _mm_mask_abs_epi32(_ml 28i s,_mmask8 k,_ml 281 a); 

VPABSD_m128i_mm_maskz_abs_epi32(_mmaskB k,_ml 281 a); 

VPABSD _m5121 _mm512_abs_epi32(_m5121 a); 

VPABSD_m5121 _mm512_mask_abs_epi32(_m512i s,_mmaski 6 k,_m5121 a); 

VPABSD_m5121 _mm512_maskz_abs_epi32(_mmaski 6 k,_m512i a); 

VPABSQ _m5121 _mm512_abs_epi64( _m512i a); 

VPABSQ_m5121 _mm512_mask_abs_epi64(_m512i s,_mmaskB k,_m512i a); 

VPABSQ_m5121 _mm512_maskz_abs_epi64(_mmaskB k,_m512i a); 

VPABSQ_m256i _mm256_mask_abs_epi64(_m256i s,_mmaskB k,_m256i a); 

VPABSQ_m256i _mm256_maskz_abs_epi64(_mmaskB k,_m256i a); 

VPABSQ_ml 28i _mm_mask_abs_epi64(_ml 28i s,_mmaskB k,_ml 281 a); 

VPABSQ_ml 281 _mm_maskz_abs_epi64(_mmaskB k,_ml 281 a); 

PABSB_ml 281 _mm_abs_epi8 (_ml 281 a) 

VPABSB_ml 28i _mm_abs_epi8 (_ml 281 a) 


4-184 Vol. 28 


PABSB/PABSW/PABSD/PABSQ - Packed Absolute Value 


INSTRUCTION SET REFERENCE, M-U 


VPABSB_m256l _mm256_abs_epl8 (_m256l a) 

PABSW_ml 281 _mm_abs_epl16 (_ml 28i a) 

VPABSW_ml 281 _mm_abs_epi16 (_ml 281 a) 

VPABSW _m256l _mm256_abs_epi16 (_m256i a) 

PABSD_ml 281 _mm_abs_epi32 (_ml 281 a) 

VPABSD_ml 28i _mm_abs_epl32 (_ml 28i a) 

VPABSD _m256i _mm256_abs_epl32 (_m256i a) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded VPABSD/Q, see Exceptions Type E4. 
EVEX-encoded VPABSB/W, see Exceptions Type E4.nb. 
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PACKSSWB/PACKSSDW—Pack with Signed Saturation 


Opcode/ 

Instruction 

Op/ 

Gn 

64/32 bit 

Mode 

Support 

CPUID 

Feature Fiag 

Description 

OF 63 /r' 

PACKSSWB mm 7, mm2/m64 

RM 

V/V 

MMX 

Converts 4 packed signed word integers from 
mm 7 and from mm2/m64 into 8 packed 
signed byte integers in mml using signed 
saturation. 

66 OF 63 /r 

PACKSSWB xmm 7, xmm2/m 7 28 

RM 

v/v 

SSE2 

Converts 8 packed signed word integers from 
xmm 7 and from xxm2/m128 into 16 packed 
signed byte integers in xxm7 using signed 
saturation. 

OF 6B /r' 

PACKSSDW mmh mm2/m64 

RM 

V/V 

MMX 

Converts 2 packed signed doubleword 
integers from mm7 and from mm2/m64 into 4 
packed signed word integers in mml using 
signed saturation. 

66 OF 6B /r 

PACKSSDW xmm 7, xmm2/m 7 28 

RM 

v/v 

SSE2 

Converts 4 packed signed doubleword 
integers from xmm 7 and from xxm2/m 128 
into 8 packed signed word integers in xxml 
using signed saturation. 

VEX.NDS.128.66.0F.WIG 63 /r 

VPACKSSWB xmm l,xmm2, xmm3/m 7 28 

RVM 

v/v 

AVX 

Converts 8 packed signed word integers from 
xmm2 and from xmm3/m 728 into 16 packed 
signed byte integers in xmmi using signed 
saturation. 

VEX.NDS.128.66.0F.WIG 68 /r 

VPACKSSDW xmml,xmm2, xmm3/ml28 

RVM 

v/v 

AVX 

Converts 4 packed signed doubleword 
integers from xmm2 and from xmm3/m 128 
into 8 packed signed word integers in xmmi 
using signed saturation. 

VEX.NDS.256.66.0F.WIG 63 /r 

VPACKSSWB ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Converts 16 packed signed word integers 
from ymm2 and from ymm3/m256 into 32 
packed signed byte integers in ymm 7 using 
signed saturation. 

VEX.NDS.256.66.0F.WIG 68 /r 

VPACKSSDW ymm 7, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Converts 8 packed signed doubleword 
integers from ymm2 and from ymm3/m256 
into 16 packed signed word integers in 
ymm 7 using signed saturation. 

EVEX.NDS.128.66.0F.WIG 63 /r 

VPACKSSWB xmmi {kl}[z}, xmm2, xmm3/ml28 

FVM 

v/v 

AVX512VL 

AVX512BW 

Converts packed signed word integers from 
xmm2 and from xmm3/m 128 into packed 
signed byte integers in xmmi using signed 
saturation under writemask k1. 

EVEX.NDS.256.66.0F.WIG 63 Ir 

VPACKSSWB ymm 7 {k1}[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Converts packed signed word integers from 
ymm2 and from ymm3/m256 into packed 
signed byte integers in ymml using signed 
saturation under writemask k1. 

EVEX.NDS.512.66.0F.WIG 63 Ir 

VPACKSSWB zmml [klXz], zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Converts packed signed word integers from 
zmm2 and from zmm3/m5 72 into packed 
signed byte integers in zmml using signed 
saturation under writemask k1. 

EVEX.NDS.128.66.0F.W0 6B Ir 

VPACKSSDW xmmi [k1 }[z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512BW 

Converts packed signed doubleword integers 
from xmm2and from xmm3/m128/m32bcst 
into packed signed word integers in xmmi 
using signed saturation under writemask k1. 
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EVEX.NDS.256.66.0F.W0 6B /r 

VPACKSSDW ymmi {kl }{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

V/V 

AVX512VL 
AVX512BW 

Converts packed signed doubleword integers 
from ymmZ and from ymm3/m256/m32bcst 
into packed signed word integers in ymmi 
using signed saturation under writemask kl. 

EVEX.NDS.512.66.0F.W0 6B /r 

VPACKSSDW zmmi {k1}{z}, zmm2, 
zmm3/m512/m32bcst 

FV 

V/V 

AVX512BW 

Converts packed signed doubleword integers 
from zmm2 and from zmm3/m512/m32bcst 
into packed signed word integers in zmmi 
using signed saturation under writemask kl. 


NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Converts packed signed word integers into packed signed byte integers (PACKSSWB) or converts packed signed 
doubleword integers into packed signed word integers (PACKSSDW), using saturation to handle overflow condi¬ 
tions. See Figure 4-6 for an example of the packing operation. 



Figure 4-6. Operation of the PACKSSDW Instruction Using 64-bit Operands 


PACKSSWB converts packed signed word integers in the first and second source operands into packed signed byte 
integers using signed saturation to handle overflow conditions beyond the range of signed byte integers. If the 
signed doubleword value is beyond the range of an unsigned word (i.e. greater than 7FFI or less than 80FI), the 
saturated signed byte integer value of 7FFI or 80FI, respectively, is stored in the destination. PACKSSDW converts 
packed signed doubleword integers in the first and second source operands into packed signed word integers using 
signed saturation to handle overflow conditions beyond 7FFFFI and 8000FI. 

EVEX encoded PACKSSWB: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM 
register, updated conditional under the writemask kl. 

EVEX encoded PACKSSDW: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32- 
bit memory location. The destination operand is a ZMM/YMM/XMM register, updated conditional under the 
writemask kl. 
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VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. The upper bits (MAX_VL-1:256) of the 
corresponding ZMM register destination are zeroed. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding ZMM destination register destination are unmodified. 

Operation 

PACKSSWB instruction (128-bit Legacy SSE version) 

DEST[7:0] ^ SaturateSignedWordToSignedByte (DEST[15:0]); 

DEST[15:8] ^ SaturateSignedWordToSignedByte (DEST[31:16]); 

DEST[23:16] ^ SaturateSignedWordToSignedByte (DEST[47:32]); 

DEST[31:24] ^ SaturateSignedWordToSignedByte (DEST[63:48]); 

DEST[39:32] ^ SaturateSignedWordToSignedByte (DEST[79:64]); 

DEST[47:40] ^ SaturateSignedWordToSignedByte (DEST[95:80]); 

DEST[55:48] ^ SaturateSignedWordToSignedByte (DEST[111:96]); 

DEST[63:56] ^ SaturateSignedWordToSignedByte (DEST[127:112]); 

DEST[71:64] ^ SaturateSignedWordToSignedByte (SRC[15:0]); 

DEST[79:72] ^ SaturateSignedWordToSignedByte (SRC[31:16]); 

DEST[87:80] ^ SaturateSignedWordToSignedByte (SRC[47:32]); 

DEST[95:88] ^ SaturateSignedWordToSignedByte (SRC[63:48]); 

DEST[103:96] ^ SaturateSignedWordToSignedByte (SRC[79:64]); 

DEST[111:104] ^ SaturateSignedWordToSignedByte (SRC[95:80]); 

DEST[119:112] ^ SaturateSignedWordToSignedByte (SRC[111:96]); 

DEST[127:120] ^ SaturateSignedWordToSignedByte (SRC[127:112]); 

DEST[MAX_VL-1:128] (Unmodified) 

PACKSSDW instruction (128-bit Legacy SSE version) 

DEST[15:0] <- SaturateSignedDwordToSignedWord (DEST[31:0]); 

DEST[31:16] ^ SaturateSignedDwordToSignedWord (DEST[63:32]); 

DEST[47:32] ^ SaturateSignedDwordToSignedWord (DEST[95:64]); 

DEST[63:48] ^ SaturateSignedDwordToSignedWord (DEST[127:96]); 

DEST[79:64] <- SaturateSignedDwordToSignedWord (SRC[31:0]); 

DEST[95:80] ^ SaturateSignedDwordToSignedWord (SRC[63:32]); 

DEST[111:96] ^ SaturateSignedDwordToSignedWord (SRC[95:64]); 

DEST[127:112] ^ SaturateSignedDwordToSignedWord (SRC[127:96]); 

DEST[MAX_VL-1:128] (Unmodified) 
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VPACKSSWB instruction (VEX.128 encoded version) 

DEST[7:0] ^ SaturateSignedWordToSIgnedByte (SRC1 [15:0]); 
DEST[15:8] ^ SaturateSIgnedWordToSignedByte (SRC1[31:16]); 
DEST[23:16] ^ SaturateSIgnedWordToSignedByte (SRC1 [47:32]); 
DEST[31:24] ^ SaturateSIgnedWordToSignedByte (SRC1 [63:48]); 
DEST[39:32] ^ SaturateSIgnedWordToSignedByte (SRC1 [79:64]); 
DEST[47:40] ^ SaturateSIgnedWordToSignedByte (SRC1 [95:80]); 
DEST[55:48] ^ SaturateSIgnedWordToSignedByte (SRC1 [111:96]); 
DEST[63:56] ^ SaturateSIgnedWordToSignedByte (SRC1 [127:112]); 
DEST[71:64] ^ SaturateSIgnedWordToSignedByte (SRC2[15:0]); 
DEST[79:72] ^ SaturateSIgnedWordToSignedByte (SRC2[31:16]); 
DEST[87:80] ^ SaturateSIgnedWordToSignedByte (SRC2[47:32]); 
DEST[95:88] ^ SaturateSIgnedWordToSignedByte (SRC2[63:48]); 
DEST[103:96] ^ SaturateSIgnedWordToSignedByte (SRC2[79:64]); 
DEST[111:104] ^ SaturateSIgnedWordToSignedByte (SRC2[95:80]); 
DEST[119:112] ^ SaturateSIgnedWordToSignedByte (SRC2[111:96]); 
DEST[127:120] ^ SaturateSIgnedWordToSignedByte (SRC2[127:112]); 
DEST[MAX_VL-1:128]^0; 

VPACKSSDW instruction (\/EX.128 encoded version) 

DEST[15:0] ^ SaturateSignedDwordToSignedWord (SRC1 [31:0]); 
DEST[31:16] ^ SaturateSignedDwordToSignedWord (SRC1 [63:32]); 
DEST[47:32] ^ SaturateSignedDwordToSignedWord (SRC1 [95:64]); 
DEST[63:48] ^ SaturateSignedDwordToSignedWord (SRC1 [127:96]); 
DEST[79:64] <- SaturateSignedDwordToSignedWord (SRC2[31:0]); 
DEST[95:80] ^ SaturateSignedDwordToSignedWord (SRC2[63:32]); 
DEST[111:96] ^ SaturateSignedDwordToSignedWord (SRC2[95:64]); 
DEST[127:112] ^ SaturateSignedDwordToSignedWord (SRC2[127:96]); 
DEST[MAX_VL-1:128]^0; 

VPACKSSWB instruction (VEX.256 encoded version) 

DEST[7:0] ^ SaturateSIgnedWordToSignedByte (SRC1 [15:0]); 

DEST[15:8] ^ SaturateSIgnedWordToSignedByte (SRC1 [31:16]); 
DEST[23:16] ^ SaturateSIgnedWordToSignedByte (SRC1 [47:32]); 
DEST[31:24] ^ SaturateSIgnedWordToSignedByte (SRC1 [63:48]); 
DEST[39:32] ^ SaturateSIgnedWordToSignedByte (SRC1 [79:64]); 
DEST[47:40] ^ SaturateSIgnedWordToSignedByte (SRC1 [95:80]); 
DEST[55:48] ^ SaturateSIgnedWordToSignedByte (SRC1 [111:96]); 
DEST[63:56] ^ SaturateSIgnedWordToSignedByte (SRC1 [127:112]); 
DEST[71:64] ^ SaturateSIgnedWordToSignedByte (SRC2[15:0]); 
DEST[79:72] ^ SaturateSIgnedWordToSignedByte (SRC2[31:16]); 
DEST[87:80] ^ SaturateSIgnedWordToSignedByte (SRC2[47:32]); 
DEST[95:88] ^ SaturateSIgnedWordToSignedByte (SRC2[63:48]); 
DEST[103:96] ^ SaturateSIgnedWordToSignedByte (SRC2[79:64]); 
DEST[111:104] ^ SaturateSIgnedWordToSignedByte (SRC2[95:80]); 
DEST[119:112] ^ SaturateSIgnedWordToSignedByte (SRC2[111:96]); 
DEST[127:120] ^ SaturateSIgnedWordToSignedByte (SRC2[127:112]); 
DEST[135:128] ^ SaturateSIgnedWordToSignedByte (SRC1 [143:128]); 
DEST[143:136] ^ SaturateSIgnedWordToSignedByte (SRC1 [159:144]); 
DEST[151:144] ^ SaturateSIgnedWordToSignedByte (SRC1 [175:160]); 
DEST[159:152] ^ SaturateSIgnedWordToSignedByte (SRC1 [191:176]); 
DEST[167:160] ^ SaturateSIgnedWordToSignedByte (SRC1 [207:192]); 
DEST[175:168] ^ SaturateSIgnedWordToSignedByte (SRC1 [223:208]); 
DEST[183:176] ^ SaturateSIgnedWordToSignedByte (SRC1 [239:224]); 
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DEST[191:184] ^ SaturateSIgnedWordToSignedByte (SRC1 [255:240]); 
DEST[199:192] ^ SaturateSIgnedWordToSignedByte (SRC2[143:128]); 
DEST[207:200] ^ SaturateSIgnedWordToSignedByte (SRC2[159:144]); 
DEST[21 5:208] ^ SaturateSignedWordToSignedByte (SRC2[175:160]); 
DEST[223:216] ^ SaturateSignedWordToSignedByte (SRC2[191:176]); 
DEST[231:224] ^ SaturateSignedWordToSignedByte (SRC2[207:192]); 
DEST[239:232] ^ SaturateSignedWordToSignedByte (SRC2[223:208]); 
DEST[247:240] ^ SaturateSignedWordToSignedByte (SRC2[239:224]); 
DEST[255:248] ^ SaturateSignedWordToSignedByte (SRC2[255:240]); 
DEST[MAX_VL-1:256] ^0; 

VPACKSSDW instruction (\/EX.256 encoded version) 

DEST[15:0] ^ SaturateSignedDwordToSignedWord (SRC1[31:0]); 
DEST[31:16] ^ SaturateSignedDwordToSignedWord (SRC1 [63:32]); 
DEST[47:32] ^ SaturateSignedDwordToSignedWord (SRC1 [95:64]); 
DEST[63:48] ^ SaturateSignedDwordToSignedWord (SRC1 [127:96]); 
DEST[79:64] <- SaturateSignedDwordToSignedWord (SRC2[31:0]); 
DEST[95:80] ^ SaturateSignedDwordToSignedWord (SRC2[63:32]); 

DEST[111:96] ^ SaturateSignedDwordToSignedWord (SRC2[95:64]); 

DEST[127:112] ^ SaturateSignedDwordToSignedWord (SRC2[127:96]); 
DEST[143:128] ^ SaturateSignedDwordToSignedWord (SRC1 [159:128]); 
DEST[159:144] ^ SaturateSignedDwordToSignedWord (SRC1 [191:160]); 
DEST[175:160] ^ SaturateSignedDwordToSignedWord (SRC1 [223:192]); 
DEST[191:176] ^ SaturateSignedDwordToSignedWord (SRC1 [255:224]); 
DEST[207:192] ^ SaturateSignedDwordToSignedWord (SRC2[159:128]); 
DEST[223:208] ^ SaturateSignedDwordToSignedWord (SRC2[191:160]); 
DEST[239:224] ^ SaturateSignedDwordToSignedWord (SRC2[223:192]); 
DEST[255:240] ^ SaturateSignedDwordToSignedWord (SRC2[255:224]); 
DEST[MAX_VL-1:256] ^0; 

VPACKSSWB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

TMP_DEST[7:0] ^ SaturateSignedWordToSignedByte (SRC1 [15:0]); 

TMP_DEST[15:8] ^ SaturateSignedWordToSignedByte (SRC1 [31:16]); 
TMP_DEST[23:16] ^ SaturateSignedWordToSignedByte (SRC1 [47:32]); 
TMP_DEST[31:24] ^ SaturateSignedWordToSignedByte (SRC1 [63:48]); 
TMP_DEST[39:32] ^ SaturateSignedWordToSignedByte (SRC1 [79:64]); 
TMP_DEST[47:40] ^ SaturateSignedWordToSignedByte (SRC1 [95:80]); 
TMP_DEST[55:48] ^ SaturateSignedWordToSignedByte (SRC1 [111:96]); 
TMP_DEST[63:56] ^ SaturateSignedWordToSignedByte (SRC1 [127:112]); 
TMP_DEST[71:64] ^ SaturateSignedWordToSignedByte (SRC2[15:0]); 
TMP_DEST[79:72] ^ SaturateSignedWordToSignedByte (SRC2[31:16]); 
TMP_DEST[87:80] ^ SaturateSignedWordToSignedByte (SRC2[47:32]); 
TMP_DEST[95:88] ^ SaturateSignedWordToSignedByte (SRC2[63:48]); 
TMP_DEST[103:96] ^ SaturateSignedWordToSignedByte (SRC2[79:64]); 
TMP_DEST[111:104] ^ SaturateSignedWordToSignedByte (SRC2[95:80]); 
TMP_DEST[119:112] ^ SaturateSignedWordToSignedByte (SRC2[111:96]); 
TMP_DEST[127:120] ^ SaturateSignedWordToSignedByte (SRC2[127:112]); 
IFVL>=256 

TMP_DEST[135:128]^ SaturateSignedWordToSignedByte (SRC1 [143:128]); 
TMP_DEST[143:136] ^ SaturateSignedWordToSignedByte (SRC1 [159:144]); 
TMP_DEST[151:144] ^ SaturateSignedWordToSignedByte (SRC1 [175:160]); 
TMP_DEST[159:152] ^ SaturateSignedWordToSignedByte (SRC1 [191:176]); 
TMP_DEST[167:160] ^ SaturateSignedWordToSignedByte (SRC1 [207:192]); 
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TMP_DEST[175:168] <- SaturateSIgnedWordToSignedByte 
TMP_DEST[183:176] <- SaturateSIgnedWordToSignedByte 
TMP_DEST[191:184] <- SaturateSIgnedWordToSignedByte 
TMP_DEST[199:192] <- SaturateSignedWordToSignedByte 
TMP_DEST[207:200] <- SaturateSignedWordToSignedByte 
TMP_DEST[215:208] <- SaturateSignedWordToSignedByte 
TMP_DEST[223:216] <- SaturateSignedWordToSignedByte 
TMP_DEST[231:224] <- SaturateSignedWordToSignedByte 
TMP_DEST[239:232] <- SaturateSignedWordToSignedByte 
TMP_DEST[247:240] <- SaturateSignedWordToSignedByte 
TMP_DEST[255:248] <- SaturateSignedWordToSignedByte 
FI; 

IFVL>=512 

TMP_DEST[263:256] <- SaturateSignedWordToSignedByte 
TMP_DEST[271:264] <- SaturateSignedWordToSignedByte 
TMP_DEST[279:272] <- SaturateSignedWordToSignedByte 
TMP_DEST[287:280] <- SaturateSignedWordToSignedByte 
TMP_DEST[295:288] <- SaturateSignedWordToSignedByte 
TMP_DEST[303:296] <- SaturateSignedWordToSignedByte 
TMP_DEST[311:304] <- SaturateSignedWordToSignedByte 
TMP_DEST[319:312] <- SaturateSignedWordToSignedByte 

TMP_DEST[327:320] <- SaturateSignedWordToSignedByte 
TMP_DEST[335:328] <- SaturateSignedWordToSignedByte 
TMP_DEST[343:336] <- SaturateSignedWordToSignedByte 
TMP_DEST[351:344] <- SaturateSignedWordToSignedByte 
TMP_DEST[359:352] <- SaturateSignedWordToSignedByte 
TMP_DEST[367:360] <- SaturateSignedWordToSignedByte 
TMP_DEST[375:368] <- SaturateSignedWordToSignedByte 
TMP_DEST[383:376] <- SaturateSignedWordToSignedByte 

TMP_DEST[391:384] <- SaturateSignedWordToSignedByte 
TMP_DEST[399:392] <- SaturateSignedWordToSignedByte 
TMP_DEST[407:400] <- SaturateSignedWordToSignedByte 
TMP_DEST[415:408] <- SaturateSignedWordToSignedByte 
TMP_DEST[423:416] <- SaturateSignedWordToSignedByte 
TMP_DEST[431:424] <- SaturateSignedWordToSignedByte 
TMP_DEST[439:432] <- SaturateSignedWordToSignedByte 
TMP_DEST[447:440] <- SaturateSignedWordToSignedByte 

TMP_DEST[455:448] <- SaturateSignedWordToSignedByte 
TMP_DEST[463:456] <- SaturateSignedWordToSignedByte 
TMP_DEST[471:464] <- SaturateSignedWordToSignedByte 
TMP_DEST[479:472] <- SaturateSignedWordToSignedByte 
TMP_DEST[487:480] <- SaturateSignedWordToSignedByte 
TMP_DEST[495:488] <- SaturateSignedWordToSignedByte 
TMP_DEST[503:496] <- SaturateSignedWordToSignedByte 
TMP_DEST[511:504] <- SaturateSignedWordToSignedByte 
FI; 

FOR] ^0 TO KL-1 
i ^J*8 

IF k10] OR *no writemask* 

THEN 

DEST[i+7:i] ^ TMP_DEST[i+7:i] 


(SRC1 [223:208]); 
(SRC1 [239:224]); 
(SRC1 [255:240]); 
(SRC2[143:128]); 
(SRC2[159:144]); 
(SRC2[175:160]); 
(SRC2[191:176]); 
(SRC2[207:192]); 
(SRC2[223:208]); 
(SRC2[239:224]); 
(SRC2[255:240]); 


(SRC1 [271:256]); 
(SRC1 [287:272]); 
(SRC1 [303:288]); 
(SRC1 [319:304]); 
(SRC1 [335:320]); 
(SRC1 [351:336]); 
(SRC1 [367:352]); 
(SRC1 [383:368]); 

(SRC2[271:256]); 
(SRC2[287:272]); 
(SRC2[303:288]); 
(SRC2[319:304]); 
(SRC2[335:320]); 
(SRC2[351:336]); 
(SRC2[367:352]); 
(SRC2[383:368]); 

(SRC1 [399:384]); 
(SRC1 [415:400]); 
(SRC1 [431:416]); 
(SRC1 [447:432]); 
(SRC1 [463:448]); 
(SRC1 [479:464]); 
(SRC1 [495:480]); 
(SRC1 [511:496]); 

(SRC2[399:384]); 
(SRC2[415:400]); 
(SRC2[431:416]); 
(SRC2[447:432]); 
(SRC2[463:448]); 
(SRC2[479:464]); 
(SRC2[495:480]); 
(SRC2[511:496]); 
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ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

TFIEN *DEST[l+7:i] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

VPACKSSDW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO ((KL/2)-1) 
i^j*32 


IF (EVEX.b == 1) AND (SRC2 *ls memory*) 

THEN 

TMP_SRC2[I+31 :l] ^ SRC2[31:0] 

ELSE 

TMP_SRC2[I+31:I] ^ SRC2[l+31:i] 

FI; 

ENDFOR; 

TMP_DEST[15:0] ^ SaturateSignedDwordToSignedWord (SRC1 [31:0]); 
TMP_DEST[31:16] ^ SaturateSignedDwordToSignedWord (SRC1 [63:32]); 
TMP_DEST[47:32] ^ SaturateSignedDwordToSignedWord (SRC1 [95:64]); 
TMP_DEST[63:48] ^ SaturateSignedDwordToSignedWord (SRC1 [127:96]); 
TMP_DEST[79:64] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[31:0]); 
TMP_DEST[95:80] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[63:32]); 
TMP_DEST[111:96] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[95:64]); 
TMP_DEST[127:112] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[127:96]); 
IFVL>=256 

TMP_DEST[143:128] ^ SaturateSignedDwordToSignedWord (SRC1 [159:128]); 
TMP_DEST[159:144] ^ SaturateSignedDwordToSignedWord (SRC1 [191:160]); 
TMP_DEST[175:160] ^ SaturateSignedDwordToSignedWord (SRC1 [223:192]); 
TMP_DEST[191:176] ^ SaturateSignedDwordToSignedWord (SRC1 [255:224]); 
TMP_DEST[207:192] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[159:128]); 
TMP_DEST[223:208] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[191:160]); 
TMP_DEST[239:224] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[223:192]); 
TMP_DEST[255:240] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[255:224]); 
FI; 

IFVL>=512 

TMP_DEST[271:256] ^ SaturateSignedDwordToSignedWord (SRC1 [287:256]); 
TMP_DEST[287:272] ^ SaturateSignedDwordToSignedWord (SRC1 [319:288]); 
TMP_DEST[303:288] ^ SaturateSignedDwordToSignedWord (SRC1 [351:320]); 
TMP_DEST[319:304] ^ SaturateSignedDwordToSignedWord (SRC1 [383:352]); 
TMP_DEST[335:320] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[287:256]); 
TMP_DEST[351:336] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[319:288]); 
TMP_DEST[367:352] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[351:320]); 
TMP_DEST[383:368] ^ SaturateSignedDwordToSignedWord (TMP_SRC2[383:352]); 

TMP_DEST[399:384] ^ SaturateSignedDwordToSignedWord (SRC1 [415:384]); 
TMP_DEST[415:400] ^ SaturateSIgnedDwordToSignedWord (SRC1 [447:416]); 
TMP_DEST[431:416] ^ SaturateSignedDwordToSignedWord (SRC1 [479:448]); 
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TMP_DEST[447:432] ^ SaturateSIgnedDwordToSIgnedWord (SRC1 [511:480]); 

TMP_DEST[463:448] ^ SaturateSIgnedDwordToSIgnedWord (TMP_SRC2[415:384]); 
TMP_DEST[479:464] ^ SaturateSIgnedDwordToSIgnedWord (TMP_SRC2[447:416]); 
TMP_DEST[495:480] ^ SaturateSIgnedDwordToSIgnedWord (TMP_SRC2[479:448]); 

TMP_DEST[511:496] ^ SaturateSIgnedDwordToSIgnedWord (TMP_SRC2[511:480]); 

FI; 

FOR] ^0 TO KL-1 
i ^J* 16 

IF k10] OR *no writemask* 

THEN DEST[i+15:1] ^ TMP_DEST[i+15:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masklng* ; zeroIng-maskIng 

DEST[I+15:I]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivaients 

VPACKSSDW_m5121 _mm512_packs_epi32(_m5121 ml_m5121 m2); 

VPACKSSDW_m5121 _mm512_mask_packs_epi32(_m512i s,_mmask32 k,_m512i ml,_m512i m2); 

VPACKSSDW_m5121 _mm512_maskz_packs_epi32(_mmask32 k,_m5121 ml,_m512i m2); 

VPACKSSDW_m256i _mm256_mask_packs_epi32(_m256i s,_mmaski 6 k,_m256i ml,_m256i m2); 

VPACKSSDW_m256i _mm256_maskz_packs_epi32(_mmaski 6 k,_m256i ml,_m256i m2); 

VPACKSSDW_ml 281 _mm_mask_packs_epi32(_ml 281 s,_mmask8 k,_ml 281 ml,_ml 281 m2); 

VPACKSSDW_ml281 _mm_maskz_packs_epi32(_mmask8 k,_ml281 ml,_ml281 m2); 

VPACKSSWB_m512i_mm512_packs_epi16(_m512i m1,_m512i m2); 

VPACKSSWB_m512i_mm512_mask_packs_epi16(_m512i s,_mmask32 k,_m512i ml,_m512i m2); 

VPACKSSWB_m5121 _mm512_maskz_packs_epi16(_mmask32 k,_m5121 ml,_m5121 m2); 

VPACKSSWB_m256i _mm256_mask_packs_epi16(_m256i s,_mmaski 6 k,_m256i ml,_m256i m2); 

VPACKSSWB_m256i_mm256_maskz_packs_epi16(_mmaski 6 k,_m256i ml,_m256i m2); 

VPACKSSWB_m128i_mm_mask_packs_epi16(_ml 281 s,_mmaskB k,_ml 281 ml,_ml 281 m2); 

VPACKSSWB_ml 281 _mm_maskz_packs_epi16(_mmaskB k,_ml 281 ml,_ml 281 m2); 

PACKSSWB_ml 281 _mm_packs_epi16(_ml 281 ml,_ml 281 m2) 

PACKSSDW_ml 281 _mm_packs_epi32(_m128i ml,_ml 281 m2) 

VPACKSSWB _m256i _mm256_packs_epi16(_m256i ml, _m256i m2) 

VPACKSSDW _m256i_mm256_packs_epi32(_m256i m1,_m256i m2) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded VPACKSSDW, see Exceptions Type E4NF. 

EVEX-encoded VPACKSSWB, see Exceptions Type E4NF.nb. 
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PACKUSDW—Pack with Unsigned Saturation 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 2B /r 

PACKUSDW xmm 1, xmm2/m 128 

RM 

V/V 

SSE4_1 

Convert 4 packed signed doubleword integers from xmm 7 
and 4 packed signed doubleword integers from 
xmm2/ml28 into 8 packed unsigned word integers in 
xmml using unsigned saturation. 

VEX.NDS.128.66.0F38 2B/r 
VPACKUSDW xmml,xmm2, 
xmm3/m 7 28 

RVM 

v/v 

AVX 

Convert 4 packed signed doubleword integers from xmm2 
and 4 packed signed doubleword integers from 
xmm3/ml28 into 8 packed unsigned word integers in 
xmm 7 using unsigned saturation. 

VEX.NDS.256.66.0F38 2B /r 
VPACKUSDW ymm 1, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX2 

Convert 8 packed signed doubleword integers from ymmZ 
and 8 packed signed doubleword integers from 
ymm3/m256 into 16 packed unsigned word integers in 
ymm 7 using unsigned saturation. 

EVEX.NDS.128.66.0F38.W0 2B /r 
VPACKUSDW xmm1(klXz}, 
xmm2, xmm3/m 7 28/m32bcst 

FV 

v/v 

AVX512VL 

AVX512BW 

Convert packed signed doubleword integers from xmm2 
and packed signed doubleword integers from 
xmm3/m128/m32bcst into packed unsigned word integers 
in xmml using unsigned saturation under writemask kl. 

EVEX.NDS.256.66.0F38.W0 2B /r 

FV 

v/v 

AVX512VL 

AVX512BW 

Convert packed signed doubleword integers from ymm2 
and packed signed doubleword integers from 
ymm3/m256/m32bcst into packed unsigned word integers 
in ymmi using unsigned saturation under writemask kl. 

EVEX.NDS.51 2.66.0F38.W0 2B /r 
VPACKUSDW zmm1{k1}{z}, 
zmm2, zmm3/m512/m32bcst 

FV 

v/v 

AVX512BW 

Convert packed signed doubleword integers from zmm2 
and packed signed doubleword integers from 
zmm3/m512/m32bcst Into packed unsigned word integers 
in zmm7 using unsigned saturation under writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Converts packed signed doubleword integers in the first and second source operands into packed unsigned word 
integers using unsigned saturation to handle overflow conditions. If the signed doubleword value is beyond the 
range of an unsigned word (that is, greater than FFFFH or less than OOOOH), the saturated unsigned word integer 
value of FFFFH or OOOOH, respectively, is stored in the destination. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32- 
bit memory location. The destination operand is a ZMM register, updated conditionally under the writemask kl. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register 
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) of the 
corresponding ZMM register destination are zeroed. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding destination register destination are unmodified. 
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Operation 

PACKUSDW (Legacy SSE instruction) 

TMP[15:0] ^ (DEST[31:0] < 0) ? 0 : DEST[15:0]; 

DEST[15:0] ^ (DEST[31:0] > FFFFH) ? FFFFH : TMP[15:0]; 

TMP[31:16] ^ (DEST[63:32] < 0) ? 0 : DEST[47:32]; 

DEST[31:16] ^ (DEST[63:32] > FFFFH) ? FFFFH : TMP[31:16]; 
TMP[47:32] ^ (DEST[95:64] < 0) ? 0 : DEST[79:64]; 

DEST[47:32] ^ (DEST[95:64] > FFFFH) ? FFFFH : TMP[47:32]; 
TMP[63:48] ^ (DEST[127:96] < 0) ? 0 : DEST[111:96]; 

DEST[63:48] ^ (DEST[127:96] > FFFFH) ? FFFFH : TMP[63:48]; 
TMP[79:64] ^ (SRC[31:0] < 0) ? 0 : SRC[15:0]; 

DEST[79:64] ^ (SRC[31:0] > FFFFH) ? FFFFH : TMP[79:64]; 
TMP[95:80] ^ (SRC[63:32] < 0)? 0 : SRC[47:32]; 

DEST[95:80] ^ (SRC[63:32] > FFFFH)? FFFFH : TMP[95:80]; 

TMP[111:96] ^ (SRC[95:64] < 0) ? 0 : SRC[79:64]; 

DEST[111:96] ^ (SRC[95:64] > FFFFH) ? FFFFH : TMP[111:96]; 
TMP[127:112] ^ (SRC[127:96] < 0) ? 0 : SRC[111:96]; 

DEST[127:112] ^ (SRC[127:96] > FFFFH) ? FFFFH : TMP[127:112]; 
DEST[MAX_VL-1:128] (Unmodified) 


PACKUSDW (VEX.128 encoded version) 

TMP[15:0] ^ (SRC1 [31:0] < 0) ? 0 : SRC1 [15:0]; 

DEST[15:0] ^ (SRC1 [31:0] > FFFFH) ? FFFFH : TMP[15:0]; 

TMP[31:16] ^ (SRC1 [63:32] < 0) ? 0 : SRC1 [47:32]; 

DEST[31:16] ^ (SRC1 [63:32] > FFFFH) ? FFFFH : TMP[31:16]; 
TMP[47:32] ^ (SRC1 [95:64] < 0) ? 0 : SRC1 [79:64]; 

DEST[47:32] ^ (SRC1 [95:64] > FFFFH) ? FFFFH : TMP[47:32]; 
TMP[63:48] ^ (SRC1 [127:96] < 0) ? 0: SRC1 [111:96]; 

DEST[63:48] ^ (SRC1 [127:96] > FFFFH) ? FFFFH : TMP[63:48] ; 
TMP[79:64] ^ (SRC2[31:0] < 0) ? 0 : SRC2[15:0]; 

DEST[79:64] ^ (SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64]; 
TMP[95:80] ^ (SRC2[63:32] < 0)7 0 : SRC2[47:32]; 

DEST[95:80] ^ (SRC2[63:32] > FFFFH)? FFFFH : TMP[95:80]; 

TMP[111:96] ^ (SRC2[95:64] < 0) ? 0 : SRC2[79:64]; 

DEST[111:96] ^ (SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96]; 
TMP[127:112] ^ (SRC2[127:96] < 0) ? 0: SRC2[111:96]; 

DEST[127:112] ^ (SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112]; 
DEST[MAX_VL-1:128]^0; 


VPACKUSDW (VEX.256 encoded version) 

TMP[15:0] ^ (SRC1 [31:0] < 0) ? 0 : SRC1 [15:0]; 

DEST[15:0] ^ (SRC1 [31:0] > FFFFH) ? FFFFH : TMP[15:0]; 
TMP[31:16] ^ (SRC1 [63:32] < 0) ? 0 : SRC1 [47:32]; 

DEST[31:16] ^ (SRC1 [63:32] > FFFFH) ? FFFFH : TMP[31:16]; 
TMP[47:32] ^ (SRC1 [95:64] < 0) ? 0 : SRC1 [79:64]; 

DEST[47:32] ^ (SRC1 [95:64] > FFFFH) ? FFFFH : TMP[47:32]; 
TMP[63:48] ^ (SRC1 [127:96] < 0) ? 0: SRC1 [111:96]; 
DEST[63:48] ^ (SRC1 [127:96] > FFFFH) ? FFFFH : TMP[63:48]; 
TMP[79:64] ^ (SRC2[31:0] < 0) ? 0 : SRC2[15:0]; 

DEST[79:64] ^ (SRC2[31:0] > FFFFH) ? FFFFH : TMP[79:64]; 
TMP[95:80] ^ (SRC2[63:32] < 0)? 0 : SRC2[47:32]; 

DEST[95:80] ^ (SRC2[63:32] > FFFFH)? FFFFH : TMP[95:80]; 
TMP[111:96] ^ (SRC2[95:64] < 0) ? 0 : SRC2[79:64]; 

DEST[111:96] ^ (SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96]; 
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TMP[127:112] ^ (SRC2[127:96] < 0) ? 0 : SRC2[111:96]; 

DEST[127:112] ^ (SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112]; 
TMP[143:128] ^ (SRC1 [159:128] < 0) ? 0 : SRC1 [143:128]; 

DEST[143:128] ^ (SRC1 [159:128] > FFFFH) ? FFFFH : TMP[143:128] ; 
TMP[159:144] ^ (SRC1 [191:160] < 0) ? 0 : SRC1 [175:160]; 

DEST[159:144] ^ (SRC1 [191:160] > FFFFH) ? FFFFH : TMP[159:144] ; 
TMP[175:160] ^ (SRC1 [223:192] < 0) ? 0 : SRC1 [207:192]; 

DEST[175:160] ^ (SRC1 [223:192] > FFFFH) ? FFFFH : TMP[175:160] ; 
TMP[191:176] ^ (SRC1 [255:224] < 0) ? 0 : SRC1 [239:224]; 

DEST[191:176] ^ (SRC1 [255:224] > FFFFH) ? FFFFH : TMP[191:176] ; 
TMP[207:192] ^ (SRC2[159:128] < 0) ? 0 : SRC2[143:128]; 
DEST[207:192] ^ (SRC2[159:128] > FFFFH) ? FFFFH : TMP[207:192] ; 
TMP[223:208] ^ (SRC2[191:160] < 0) ? 0 : SRC2[175:160]; 
DEST[223:208] ^ (SRC2[191:160] > FFFFH) ? FFFFH : TMP[223:208] ; 
TMP[239:224] ^ (SRC2[223:192] < 0) ? 0 : SRC2[207:192]; 
DEST[239:224] ^ (SRC2[223:192] > FFFFH) ? FFFFH : TMP[239:224] ; 
TMP[255:240] ^ (SRC2[255:224] < 0)? 0 : SRC2[239:224]; 
DEST[255:240] ^ (SRC2[255:224] > FFFFH)? FFFFH : TMP[255:240] ; 
DEST[MAX_VL-1:256]^0; 


VPACKUSDW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 
FOR] ^0 TO ((KL/2)-1) 
i^j*32 


IF (EVEX.b == 1) AND (SRC2 *ls memory*) 

THEN 

TMP_SRC2[I+31 :l] ^ SRC2[31:0] 

ELSE 

TMP_SRC2[I+31:I] ^ SRC2[l+31:i] 

FI; 

ENDFOR; 

TMP[15:0] ^ (SRC1 [31:0] < 0) ? 0 : SRC1 [15:0]; 

DEST[15:0] ^ (SRC1 [31:0] > FFFFH) ? FFFFH : TMP[15:0] ; 

TMP[31:16] ^ (SRC1 [63:32] < 0) ? 0 : SRC1 [47:32]; 

DEST[31:16] ^ (SRC1 [63:32] > FFFFH) ? FFFFH : TMP[31:16]; 

TMP[47:32] ^ (SRC1 [95:64] < 0) ? 0 : SRC1 [79:64]; 

DEST[47:32] ^ (SRC1 [95:64] > FFFFH) ? FFFFH : TMP[47:32]; 

TMP[63:48] ^ (SRC1 [127:96] < 0) ? 0 : SRC1 [111:96]; 

DEST[63:48] ^ (SRC1 [127:96] > FFFFH) ? FFFFH : TMP[63:48]; 
TMP[79:64] ^ (TMP_SRC2[31:0] < 0) ? 0 : TMP_SRC2[15:0]; 

DEST[79:64] ^ (TMP_SRC2[31:0] > FFFFH)? FFFFH : TMP[79:64] ; 
TMP[95:80] ^ (TMP_SRC2[63:32] < 0)? 0 : TMP_SRC2[47:32]; 
DEST[95:80] ^ (TMP_SRC2[63:32] > FFFFH)? FFFFH : TMP[95:80]; 

TMP[111:96] ^ (TMP_SRC2[95:64] < 0) ? 0 : TMP_SRC2[79:64]; 

DEST[111:96] ^ (TMP_SRC2[95:64] > FFFFH) ? FFFFH : TMP[111:96]; 
TMP[127:112] ^ (TMP_SRC2[127:96] < 0) ? 0 : TMP_SRC2[111:96]; 
DEST[127:112] ^ (TMP_SRC2[127:96] > FFFFH) ? FFFFH : TMP[127:112]; 
IFVL>=256 

TMP[143:128] ^ (SRC1 [159:128] < 0) ? 0 : SRC1 [143:128]; 

DEST[143:128] ^ (SRC1 [159:128] > FFFFH) ? FFFFH : TMP[143:128]; 
TMP[159:144] ^ (SRC1 [191:160] < 0) ? 0 : SRC1 [175:160]; 

DEST[159:144] ^ (SRC1 [191:160] > FFFFH) ? FFFFH : TMP[159:144]; 
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TMP[175:160] ^ (SRC1 [223:192] < 0) ? 0 : SRC1 [207:192]; 

DEST[175:160] ^ (SRC1 [223:192] > FFFFH) ? FFFFH : TMP[175:160]; 
TMP[191:176] ^ (SRC1 [255:224] < 0) ? 0 : SRC1 [239:224]; 

DEST[191:176] ^ (SRC1 [255:224] > FFFFH) ? FFFFH : TMP[191:176]; 
TMP[207:192] ^ (TMP_SRC2[159:128] < 0) ? 0 : TMP_SRC2[143:128]; 
DEST[207:192] ^ (TMP_SRC2[159:128] > FFFFH) ? FFFFH : TMP[207:192]; 
TMP[223:208] ^ (TMP_SRC2[191:160] < 0) ? 0 : TMP_SRC2[175:160]; 
DEST[223:208] ^ (TMP_SRC2[191:160] > FFFFH) ? FFFFH : TMP[223:208]; 
TMP[239:224] ^ (TMP_SRC2[223:192] < 0) ? 0 : TMP_SRC2[207:192]; 
DEST[239:224] ^ (TMP_SRC2[223:192] > FFFFH) ? FFFFH : TMP[239:224]; 
TMP[255:240] ^ (TMP_SRC2[255:224] < 0)? 0 : TMP_SRC2[239:224]; 
DEST[255:240] ^ (TMP_SRC2[255:224] > FFFFH)? FFFFH : TMP[255:240]; 
FI; 

IFVL>=512 

TMP[271:256] ^ (SRC1 [287:256] < 0) ? 0 : SRC1 [271:256]; 

DEST[271:256] ^ (SRC1 [287:256] > FFFFH) ? FFFFH : TMP[271:256]; 
TMP[287:272] ^ (SRC1 [319:288] < 0) ? 0 : SRC1 [303:288]; 

DEST[287:272] ^ (SRC1 [319:288] > FFFFH) ? FFFFH : TMP[287:272]; 
TMP[303:288] ^ (SRC1 [351:320] < 0) ? 0 : SRC1 [335:320]; 

DEST[303:288] ^ (SRC1 [351:320] > FFFFH) ? FFFFH : TMP[303:288]; 
TMP[319:304] ^ (SRC1 [383:352] < 0) ? 0 : SRC1 [367:352]; 

DEST[319:304] ^ (SRC1 [383:352] > FFFFH) ? FFFFH : TMP[319:304]; 
TMP[335:320] ^ (TMP_SRC2[287:256] < 0)7 0 : TMP_SRC2[271:256]; 
DEST[335:304] ^ (TMP_SRC2[287:256] > FFFFH)? FFFFH : TMP[79:64]; 
TMP[351:336] ^ (TMP_SRC2[319:288] < 0)? 0 : TMP_SRC2[303:288]; 
DEST[351:336] ^ (TMP_SRC2[319:288] > FFFFH) ? FFFFH : TMP[351:336]; 
TMP[367:352] ^ (TMP_SRC2[351:320] < 0) ? 0 : TMP_SRC2[315:320]; 
DEST[367:352] ^ (TMP_SRC2[351:320] > FFFFH) ? FFFFH : TMP[367:352]; 
TMP[383:368] ^ (TMP_SRC2[383:352] < 0) ? 0 : TMP_SRC2[367:352]; 
DEST[383:368] ^ (TMP_SRC2[383:352] > FFFFH)? FFFFH : TMP[383:368]; 
TMP[399:384] ^ (SRC1 [415:384] < 0) ? 0 : SRC1 [399:384]; 

DEST[399:384] ^ (SRC1 [415:384] > FFFFH) ? FFFFH : TMP[399:384]; 
TMP[415:400] ^ (SRC1 [447:416] < 0) ? 0 : SRC1 [431:416]; 

DEST[415:400] ^ (SRC1 [447:416] > FFFFH) ? FFFFH : TMP[415:400]; 
TMP[431:416] ^ (SRC1 [479:448] < 0) ? 0 : SRC1 [463:448]; 

DEST[431:416] ^ (SRC1 [479:448] > FFFFH) ? FFFFH : TMP[431:416]; 
TMP[447:432] ^ (SRC1 [511:480] < 0) ? 0 : SRC1 [495:480]; 

DEST[447:432] ^ (SRC1 [511:480] > FFFFH) ? FFFFH : TMP[447:432]; 
TMP[463:448] ^ (TMP_SRC2[415:384] < 0) ? 0 : TMP_SRC2[399:384]; 
DEST[463:448] ^ (TMP_SRC2[415:384] > FFFFH) ? FFFFH : TMP[463:448]; 
TMP[475:464] ^ (TMP_SRC2[447:416] < 0) ? 0 : TMP_SRC2[431:416]; 
DEST[475:464] ^ (TMP_SRC2[447:416] > FFFFH) ? FFFFH : TMP[475:464]; 
TMP[491:476] ^ (TMP_SRC2[479:448] < 0)? 0 : TMP_SRC2[463:448]; 
DEST[491:476] ^ (TMP_SRC2[479:448] > FFFFH) ? FFFFH : TMP[491:476]; 
TMP[511:492] ^ (TMP_SRC2[511:480] < 0) ? 0 : TMP_SRC2[495:480]; 
DEST[511:492] ^ (TMP_SRC2[511:480] > FFFFH) ? FFFFH : TMP[511:492]; 
FI; 

FOR] ^0 TO KL-1 
i 16 

IF k10] OR *no writemask* 

THEN 

DEST[I+15:1] ^ TMP_DEST[i+15:1] 

ELSE 

IF *merglng-masking* ; merging-masking 
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THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:l]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPACKUSDW_m5121 _mm512_packus_epi32(_m5121 ml, _m5121 m2); 

VPACKUSDW_m5121 _mm512_mask_packus_epi32(_m512i s,_mmask32 k,_m512i ml,_m512i m2); 

VPACKUSDW_m5121 _mm512_maskz_packus_epi32(_mmask32 k,_m512i ml,_m512i m2); 

VPACKUSDW_m256i _mm256_mask_packus_epi32(_m256i s,_mmaski 6 k,_m256i ml,_m256i m2); 

VPACKUSDW_m256i _mm256_maskz_packus_epi32(_mmaski 6 k,_m256i ml,_m256i m2); 

VPACKUSDW_ml 281 _mm_mask_packus_epi32(_ml 281 s,_mmask8 k,_ml 281 ml,_ml 281 m2); 

VPACKUSDW_ml 281 _mm_maskz_packus_epi32(_mmask8 k,_ml 281 ml,_ml 281 m2); 

PACKUSDW_m128i_mm_packus_epi32(_ml 281 ml,_ml 281 m2); 

VPACKUSDW_m256i _mm256_packus_epi32(_m256i ml, _m256i m2); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4NF. 
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PACKUSWB—Pack with Unsigned Saturation 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature Flag 

Description 

OF 67 /r' 

PACKUSWB mm, mm/m64 

RM 

V/V 

MMX 

Converts 4 signed word integers from mm and 

4 signed word integers from mm/m64 into B 
unsigned byte integers in mm using unsigned 
saturation. 

66 OF 67 /r 

PACKUSWB xmml, xmm2/m128 

RM 

v/v 

SSE2 

Converts 8 signed word integers from xmml 
and 8 signed word integers from xmm2/m 128 
into 16 unsigned byte integers in xmml using 
unsigned saturation. 

VEX.NDS.128.66.0F.WIG67/r 

VPACKUSWB xmm 1, xmmZ, xmm3/m 128 

RVM 

V/V 

AVX 

Converts 8 signed word integers from xmmZ 
and 8 signed word integers from xmm3/m 128 
into 16 unsigned byte integers in xmml using 
unsigned saturation. 

VEX.NDS.256.66.0F.WIG 67 /r 

VPACKUSWB ymmi, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Converts 16 signed word integers from ymmZ 
and 1 Bsigned word integers from 
ymm3/m256 into 32 unsigned byte integers 
in ymml using unsigned saturation. 

EVEX.NDS.1 28.66.0F.WIC 67 /r 

VPACKUSWB xmml{kl}{z}, xmm2, xmm3/ml28 

FVM 

v/v 

AVX512VL 

AVX512BW 

Converts signed word integers from xmmZ 
and signed word integers from xmm3/ml28 
into unsigned byte integers in xmml using 
unsigned saturation under writemask k1. 

EVEX.NDS.256.66.0F.WIG 67 /r 

VPACKUSWB ymm1{k1}{z}, ymmZ, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Converts signed word integers from ymmZ 
and signed word integers from ymm3/m256 
into unsigned byte integers in ymml using 
unsigned saturation under writemask k1. 

EVEX.NDS.512.66.0F.WIC 67 /r 

VPACKUSWB zmml[k1}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Converts signed word integers from zmmZ 
and signed word integers from zmm3/m512 
into unsigned byte integers in zmml using 
unsigned saturation under writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" In the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" in 
the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vuvv (r) 

ModRM:r/m (r) 

NA 


Description 

Converts 4, 8, 16 or 32 signed word integers from the destination operand (first operand) and 4, 8, 16 or 32 signed 
word integers from the source operand (second operand) into 8, 16, 32 or 64 unsigned byte integers and stores the 
result in the destination operand. (See Figure 4-6 for an example of the packing operation.) If a signed word 
integer value is beyond the range of an unsigned byte integer (that is, greater than FFFI or less than OOFI), the satu¬ 
rated unsigned byte integer value of FFFI or OOFI, respectively, is stored in the destination. 

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM 
register or a 512-bit memory location. The destination operand is a ZMM register. 
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VEX.256 and EVEX.256 encoded versions: The first source operand is a VMM register. The second source operand 
is a VMM register or a 256-bit memory location. The destination operand is a VMM register. The upper bits 
(MAX_VL-1:256) of the corresponding ZMM register destination are zeroed. 

VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand 
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits 
(MAX_VL-1:128) of the corresponding register destination are zeroed. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding register destination are unmodified. 

Operation 

PACKUSWB (with 64-bit operands) 

DEST[7:0] <- SaturateSignedWordToUnsignedByte DEST[15:0]; 

DEST[15:8] ^ SaturateSignedWordToUnsignedByte DEST[31:16]; 

DEST[23:16] ^ SaturateSignedWordToUnsignedByte DEST[47:32]; 

DEST[31:24] ^ SaturateSignedWordToUnsignedByte DEST[63:48]; 

DEST[39:32] <- SaturateSignedWordToUnsignedByte SRC[15:0]; 

DEST[47:40] <- SaturateSignedWordToUnsignedByte SRC[31:16]; 

DEST[55:48] <- SaturateSignedWordToUnsignedByte SRC[47:32]; 

DEST[63:56] <- SaturateSignedWordToUnsignedByte SRC[63:48]; 

PACKUSWB (Legacy SSE instruction) 

DEST[7:0]<-SaturateSignedWordToUnsignedByte (DEST[15:0]); 

DEST[15:8] ^SaturateSignedWordToUnsignedByte (DEST[31:16]); 

DEST[23:16] ^SaturateSignedWordToUnsignedByte (DEST[47:32]); 

DEST[31:24] ^ SaturateSignedWordToUnsignedByte (DEST[63:48]); 

DEST[39:32] ^ SaturateSignedWordToUnsignedByte (DEST[79:64]); 

DEST[47:40] <- SaturateSignedWordToUnsignedByte (DEST[95:80]); 

DEST[55:48] ^ SaturateSignedWordToUnsignedByte (DEST[111:96]); 

DEST[63:56] ^ SaturateSignedWordToUnsignedByte (DEST[127:112]); 

DEST[71:64] ^ SaturateSignedWordToUnsignedByte (SRC[15:0]); 

DEST[79:72] ^ SaturateSignedWordToUnsignedByte (SRC[31:16]); 

DEST[87:80] <- SaturateSignedWordToUnsignedByte (SRC[47:32]); 

DEST[95:88] <- SaturateSignedWordToUnsignedByte (SRC[63:48]); 

DEST[103:96] ^ SaturateSignedWordToUnsignedByte (SRC[79:64]); 

DEST[111:104] ^ SaturateSignedWordToUnsignedByte (SRC[95:80]); 

DEST[119:112] ^ SaturateSignedWordToUnsignedByte (SRC[111:96]); 

DEST[127:120] ^ SaturateSignedWordToUnsignedByte (SRC[127:112]); 

PACKUSWB (VEX.128 encoded version) 

DEST[7:0]<- SaturateSignedWordToUnsignedByte (SRC1 [15:0]); 

DEST[15:8] ^SaturateSignedWordToUnsignedByte (SRC1 [31:16]); 

DEST[23:16] ^SaturateSignedWordToUnsignedByte (SRC1 [47:32]); 

DEST[31:24] ^ SaturateSignedWordToUnsignedByte (SRC1 [63:48]); 

DEST[39:32] ^ SaturateSignedWordToUnsignedByte (SRC1 [79:64]); 

DEST[47:40] <- SaturateSignedWordToUnsignedByte (SRC1 [95:80]); 

DEST[55:48] ^ SaturateSignedWordToUnsignedByte (SRC1 [111:96]); 

DEST[63:56] ^ SaturateSignedWordToUnsignedByte (SRC1 [127:112]); 

DEST[71:64] ^ SaturateSignedWordToUnsignedByte (SRC2[15:0]); 

DEST[79:72] ^ SaturateSignedWordToUnsignedByte (SRC2[31:16]); 

DEST[87:80] ^ SaturateSignedWordToUnsignedByte (SRC2[47:32]); 

DEST[95:88] ^ SaturateSignedWordToUnsignedByte (SRC2[63:48]); 

DEST[103:96] ^ SaturateSignedWordToUnsignedByte (SRC2[79:64]); 

DEST[111:104] ^ SaturateSignedWordToUnsignedByte (SRC2[95:80]); 
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DEST[119:112] ^ SaturateSIgnedWordToUnsignedByte (SRC2[111:96]); 
DEST[127:120] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[127:112]); 
DEST[VLMAX-1:128]^0; 

VPACKUSWB (VEX.256 encoded version) 

DEST[7:0]<- SaturateSIgnedWordToUnsIgnedByte (SRC1 [15:0]); 

DEST[15:8] ^SaturateSIgnedWordToUnsignedByte (SRC1 [31:16]); 
DEST[23:16] ^SaturateSIgnedWordToUnsignedByte (SRC1 [47:32]); 
DEST[31:24] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [63:48]); 
DEST[39:32] ^SaturateSIgnedWordToUnsignedByte (SRC1 [79:64]); 
DEST[47:40] <- SaturateSIgnedWordToUnsIgnedByte (SRC1 [95:80]); 
DEST[55:48] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [111:96]); 
DEST[63:56] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [127:112]); 
DEST[71:64] ^SaturateSIgnedWordToUnsignedByte (SRC2[15:0]); 
DEST[79:72] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[31:16]); 
DEST[87:80] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[47:32]); 
DEST[95:88] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[63:48]); 
DEST[103:96] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[79:64]); 
DEST[111:104] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[95:80]); 
DEST[119:112] ^ SaturateSIgnedWordToUnsignedByte (SRC2[111:96]); 
DEST[127:120] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[127:112]); 
DEST[135:128]^ SaturateSignedWordToUnsignedByte (SRC1 [143:128]); 
DEST[143:136] ^SaturateSignedWordToUnsignedByte (SRC1 [159:144]); 
DEST[151:144] ^SaturateSignedWordToUnsignedByte (SRC1 [175:160]); 
DEST[159:152] ^SaturateSignedWordToUnsignedByte (SRC1 [191:176]); 
DEST[167:160] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [207:192]); 
DEST[175:168] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [223:208]); 
DEST[183:176] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [239:224]); 
DEST[191:184] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [255:240]); 
DEST[199:192] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[143:128]); 
DEST[207:200] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[159:144]); 
DEST[215:208] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[175:160]); 
DEST[223:216] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[191:176]); 
DEST[231:224] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[207:192]); 
DEST[239:232] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[223:208]); 
DEST[247:240] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[239:224]); 
DEST[255:248] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[255:240]); 

VPACKUSWB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

TMP_DEST[7:0] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [15:0]); 
TMP_DEST[15:8] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [31:16]); 
TMP_DEST[23:16] ^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [47:32]); 
TMP_DEST[31:24] ^ SaturateSIgnedWordToUnsIgnedByte (SRC 1 [63:48]); 
TMP_DEST[39:32] ^ SaturateSIgnedWordToUnsIgnedByte (SRC 1 [79:64]); 
TMP_DEST[47:40] ^ SaturateSIgnedWordToUnsIgnedByte (SRC 1 [95:80]); 
TMP_DEST[55:48] ^ SaturateSIgnedWordToUnsignedByte (SRC1 [111:96]); 
TMP_DEST[63:56] ^ SaturateSIgnedWordToUnsignedByte (SRC1 [127:112]); 
TMP_DEST[71:64] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[15:0]); 
TMP_DEST[79:72] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[31:16]); 
TMP_DEST[87:80] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[47:32]); 
TMP_DEST[95:88] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[63:48]); 
TMP_DEST[103:96] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[79:64]); 
TMP_DEST[111:104] ^ SaturateSIgnedWordToUnsignedByte (SRC2[95:80]); 
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TMP_DEST[119:112] ^ SaturateSignedWordToUnsIgnedByte (SRC2[111:96]); 
TMP_DEST[127:120] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[127:112]); 
IFVL>=256 


TMP_DEST[135:128]^ SaturateSIgnedWordToUnsIgnedByte (SRC1 [143:128]); 
TMP_DEST[143:136] ^ SaturateSignedWordToUnsignedByte (SRC1 [159:144]) 
TMP_DEST[151:144] ^ SaturateSignedWordToUnsignedByte (SRC1 [175:160]) 
TMP_DEST[159:152] ^ SaturateSignedWordToUnsignedByte (SRC1 [191:176]) 
TMP_DEST[167:160] ^ SaturateSignedWordToUnsignedByte (SRC1 [207:192]) 
TMP_DEST[175:168] ^ SaturateSIgnedWordToUnsignedByte (SRC1 [223:208]) 
TMP_DEST[183:176] ^ SaturateSignedWordToUnsignedByte (SRC1 [239:224]) 
TMP_DEST[191:184] ^ SaturateSignedWordToUnsignedByte (SRC1 [255:240]) 
TMP_DEST[199:192] ^ SaturateSignedWordToUnsignedByte (SRC2[143:128]) 
TMP_DEST[207:200] ^ SaturateSignedWordToUnsignedByte (SRC2[159:144]) 
TMP_DEST[215:208] ^ SaturateSignedWordToUnsignedByte (SRC2[175:160]) 
TMP_DEST[223:216] ^ SaturateSignedWordToUnsignedByte (SRC2[191:176]) 
TMP_DEST[231:224] ^ SaturateSignedWordToUnsignedByte (SRC2[207:192]) 
TMP_DEST[239:232] ^ SaturateSignedWordToUnsignedByte (SRC2[223:208]) 
TMP_DEST[247:240] ^ SaturateSignedWordToUnsignedByte (SRC2[239:224]) 
TMP_DEST[255:248] ^ SaturateSignedWordToUnsignedByte (SRC2[255:240]) 

FI; 

IFVL>=512 

TMP_DEST[263:256] ^ SaturateSignedWordToUnsignedByte (SRC1 [271:256]) 
TMP_DEST[271:264] ^ SaturateSignedWordToUnsignedByte (SRC1 [287:272]) 
TMP_DEST[279:272] ^ SaturateSignedWordToUnsignedByte (SRC1 [303:288]) 
TMP_DEST[287:280] ^ SaturateSIgnedWordToUnsignedByte (SRC1 [319:304]) 
TMP_DEST[295:288] ^ SaturateSignedWordToUnsignedByte (SRC1 [335:320]) 
TMP_DEST[303:296] ^ SaturateSignedWordToUnsignedByte (SRC1 [351:336]) 
TMP_DEST[311:304] ^ SaturateSignedWordToUnsignedByte (SRC1 [367:352]) 
TMP_DEST[319:312] ^ SaturateSIgnedWordToUnsignedByte (SRC1 [383:368]) 


TMP_DEST[327:320] 

TMP_DEST[335:328] 

TMP_DEST[343:336] 

TMP_DEST[351:344] 

TMP_DEST[359:352] 

TMP_DEST[367:360] 

TMP_DEST[375:368] 

TMP_DEST[383:376] 


<- SaturateSignedWordToUnsignedByte (SRC2[271:256]) 
<- SaturateSignedWordToUnsignedByte (SRC2[287:272]) 
<- SaturateSignedWordToUnsignedByte (SRC2[303:288]) 
<- SaturateSignedWordToUnsignedByte (SRC2[319:304]) 
<- SaturateSignedWordToUnsignedByte (SRC2[335:320]) 
<- SaturateSignedWordToUnsignedByte (SRC2[351:336]) 
<- SaturateSignedWordToUnsignedByte (SRC2[367:352]) 
<- SaturateSignedWordToUnsignedByte (SRC2[383:368]) 


TMP_DEST[391:384] 
TMP_DEST[399:392] 
TMP_DEST[407:400] 
TMP_DEST[41 5:408] 
TMP_DEST[423:416] 
TMP_DEST[431:424] 
TMP_DEST[439:432] 
TMP_DEST[447:440] 


<- SaturateSignedWordToUnsignedByte (SRC1 [399:384]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [415:400]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [431:416]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [447:432]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [463:448]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [479:464]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [495:480]) 
<- SaturateSignedWordToUnsignedByte (SRC1 [511:496]) 


TMP_DEST[455:448] 
TMP_DEST[463:456] 
TMP_DEST[471:464] 
TMP_DEST[479:472] 
TMP_DEST[487:480] 
TMP_DEST[495:488] 


<- SaturateSignedWordToUnsignedByte (SRC2[399:384]) 
<- SaturateSignedWordToUnsignedByte (SRC2[415:400]) 
<- SaturateSignedWordToUnsignedByte (SRC2[431:416]) 
<- SaturateSignedWordToUnsignedByte (SRC2[447:432]) 
<- SaturateSIgnedWordToUnsignedByte (SRC2[463:448]) 
<- SaturateSignedWordToUnsignedByte (SRC2[479:464]) 
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TMP_DEST[503:496] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[495:480]); 

TMP_DEST[511:504] ^ SaturateSIgnedWordToUnsIgnedByte (SRC2[511:496]); 

FI; 

FORj^OTO KL-1 
i ^j*8 

IF k10] OR *no writemask* 

THEN 

DEST[I+7:I] ^ TMP_DEST[l+7:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+7:l] remains unchanged* 

ELSE *zeroing-masklng* ; zeroIng-maskIng 

DEST[l+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPACKUSWB_m5121 _mm512_packus_epi16(_m5121 ml, _m5121 m2); 

VPACKUSWB_mSI 21 _mm512_mask_packus_epi16(_m512i s,_mmask64 k,_m5121 ml,_mSI 21 m2); 

VPACKUSWB_mSI 21 _mm512_maskz_packus_epi16(_mmask64 k,_m512i ml,_mSI 21 m2); 

VPACKUSWB_m256i _mm256_mask_packus_epi16(_m256i s,_mmask32 k,_m256i ml,_m256i m2); 

VPACKUSWB_m256i _mm256_maskz_packus_epi16(_mmask32 k,_m256i ml,_m256i m2); 

VPACKUSWB_m128i_mm_mask_packus_epi16(_ml 281 s,_mmasklB k,_ml 281 ml,_ml 281 m2); 

VPACKUSWB_ml 28i _mm_maskz_packus_epi16(_mmaski 6 k,_ml 28i ml,_ml 281 m2); 

PACKUSWB: _m64 _mm_packs_pu16(_m64 ml,_m64 m2) 

(V)PACKUSWB: _m1281 _mm_packus_epi16(_m1281 ml, _m1281 m2) 

VPACKUSWB: _m256i_mm256_packus_epi16(_m256i m1,_m256i m2); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PADDB/PADDW/PADDD/PADDQ-Add Packed Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

OF FC /r' 

PADDB mm, mm/m64 

RM 

V/V 

MMX 

Add packed byte integers from mm/m64 and mm. 

OF FD /r' 

PADDW mm, mm/m64 

RM 

v/v 

MMX 

Add packed word integers from mm/m64and mm. 

66 OF FC /r 

PADDB xmm 1, xmm2/m 128 

RM 

V/V 

SSE2 

Add packed byte integers from xmm2/m 7 28 and 
xmmi. 

66 OF FD /r 

PADDW xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Add packed word integers from xmm2/m128an6 
xmmi. 

66 OF FE /r 

PADDD xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Add packed doubleword integers from xmm2/m 128 
and xmmi. 

66 OF D4/r 

PADDQ xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Add packed quadword integers from xmm2/m128 
and xmmi. 

VEX.NDS.128.66.0F.WIG FC/r 

VPADDB xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed byte integers from xmm2, and 
xmm3/m128and store in xmmi. 

VEX.NDS.128.66.0F.WIG FD/r 

VPADDW xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed word integers from xmm2, xmm3/m128 
and store in xmmi. 

VEX.NDS.128.66.0F.WIG FE /r 

VPADDD xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed doubleword integers from xmm2, 
xmm3/m128and store in xmmi. 

VEX.NDS.128.66.0F.WIG D4 /r 

VPADDQ xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed quadword integers from xmm2, 
xmm3/m128and store in xmmi. 

VEX.NDS.256.66.0F.WIG FC /r 

VPADDB ymm 1, ymm2, \/mm3/m256 

RVM 

v/v 

AVX2 

Add packed byte integers from ymm2, and 
ymm3/m256 and store in ymml. 

VEX.NDS.256.66.0F.WIG FD /r 

VPADDW ymml, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Add packed word integers from ymm2, ymm3/m256 
and store in ymml. 

VEX.NDS.256.66.0F.WIG FE /r 

VPADDD ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Add packed doubleword integers from ymm2, 
ymm3/m256 and store in ymml. 

VEX.NDS.256.66.0F.WIG D4 /r 

VPADDQ ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Add packed quadword integers from ymm2, 
ymm3/m256 and store in ymml. 

EVEX.NDS.128.66.0F.WIGFC /r 

VPADDB xmmi {k1}{z}, xmm2, 
xmm3/m 128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed byte integers from xmm2, and 
xmm3/m128and store in xmmi using writemask k1. 

EVEX.NDS.128.66.0F.WIG FD /r 

VPADDW xmmi {k1}{z}, xmm2, 
xmm3/m 128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed word integers from xmm2, and 
xmm3/m128and store in xmmi using writemask k1. 

EVEX.NDS.128.66.0F.W0 FE/r 

VPADDD xmmi {k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Add packed doubleword integers from xmm2, and 
xmm3/m128/m32bcstand store in xmmi using 
writemask k1. 

EVEX.NDS.128.66.0F.W1 D4/r 

VPADDQ xmmi {k1}[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Add packed quadword integers from xmm2, and 
xmm3/m128/m64bcstand store in xmmi using 
writemask k1. 

EVEX.NDS.256.66.0F.WIG FC /r 

VPADDB ymml {k1 }[z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed byte integers from ymm2, and 
ymm3/m256 and store in ymm 7 using writemask k1. 

EVEX.NDS.256.66.0F.WIG FD /r 

VPADDW ymml [k1}{z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed word integers from ymm2, and 
ymm3/m256 and store in ymm 7 using writemask k1. 

EVEX.NDS.256.66.0F.W0 FE /r 

VPADDD ymml [klXz], ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Add packed doubleword integers from ymm2, 
ymm3/m256/m32bcst and store in ymm 7 using 
writemask k1. 
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Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

EVEX.NDS.256.66.0F.W1 D4 /r 

VPADDQ ymm 1 [k 1 }{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Add packed quadword integers from ymm2, 
ymm3/m256/m64bcstand store in ymml using 
writemask k1. 

EVEX.NDS.512.66.0F.WIG FC /r 

VPADDB zmmi {k1}[z], zmmZ, 
zmm3/m512 

FVM 

V/V 

AVX512BW 

Add packed byte integers from zmm2, and 
zmm3/m512and store in zmmi using writemask k1. 

EVEX.NDS.512.66.0F.WIG FD /r 

VPADDW zmmi {k1}[z], zmm2, 
zmm3/m512 

FVM 

V/V 

AVX512BW 

Add packed word integers from zmm2, and 
zmm3/m512and store in zmmi using writemask k1. 

EVEX.NDS.512.66.0F.W0 FE /r 

VPADDD zmmi {klXz}, zmm2, 
zmm3/m512/m32bcst 

FV 

V/V 

AVX512F 

Add packed doubleword integers from zmm2, 
zmm3/m512/m32bcst and store in zmmi using 
writemask k1. 

EVEX.NDS.512.66.0F.W1 D4/r 

VPADDQ zmmi {k1}[z], zmm2, 
zmm3/m512/m64bcst 

FV 

V/V 

AVX512F 

Add packed quadword integers from zmm2, 
zmm3/m512/m64bcst and store in zmm 1 using 
writemask k1. 

NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume 2A and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX 

Registers" in the Intel" 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD add of the packed integers from the source operand (second operand) and the destination 
operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the 
I ntel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a SIMD operation. 
Overflow is handled with wraparound, as described in the following paragraphs. 

The PADDB and VPADDB instructions add packed byte integers from the first source operand and second source 
operand and store the packed integer results in the destination operand. When an individual result is too large to 
be represented in 8 bits (overflow), the result is wrapped around and the low 8 bits are written to the destination 
operand (that is, the carry is ignored). 

The PADDW and VPADDW instructions add packed word integers from the first source operand and second source 
operand and store the packed integer results in the destination operand. When an individual result is too large to 
be represented in 16 bits (overflow), the result is wrapped around and the low 16 bits are written to the destination 
operand (that is, the carry is ignored). 

The PADDD and VPADDD instructions add packed doubleword integers from the first source operand and second 
source operand and store the packed integer results in the destination operand. When an individual result is too 
large to be represented in 32 bits (overflow), the result is wrapped around and the low 32 bits are written to the 
destination operand (that is, the carry is ignored). 

The PADDQ and VPADDQ instructions add packed quadword integers from the first source operand and second 
source operand and store the packed integer results in the destination operand. When a quadword result is too 
large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the 
destination operand (that is, the carry is ignored). 
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Note that the (V)PADDB, (V)PADDW, (V)PADDD and (V)PADDQ instructions can operate on either unsigned or 
signed (two's complement notation) packed integers; however, it does not set bits in the EFLAGS register to indi¬ 
cate overflow and/or a carry. To prevent undetected overflow conditions, software must control the ranges of 
values operated on. 

EVEX encoded VPADDD/Q: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register updated according to the 
writemask. 

EVEX encoded VPADDB/W: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM 
register updated according to the writemask. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register 
or a 256-bit memory location. The destination operand is a YMM register, the upper bits (MAX_VL-1:256) of the 
destination are cleared. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are unmodified. 

Operation 

PADDB (with 64-bit operands) 

DEST[7:0] ^ DEST[7:0] + SRC[7:0]; 

(* Repeat add operation for 2nd through 7th byte *) 

DEST[63:56] ^ DEST[63:56] + SRC[63:56]; 

PADDW (with 64-bit operands) 

DEST[15:0] ^ DEST[15:0] + SRC[15:0]; 

(* Repeat add operation for 2nd and 3th word *) 

DEST[63:48] ^ DEST[63:48] + SRC[63:48]; 

PADDD (with 64-bit operands) 

DEST[31:0] ^ DEST[31:0] + SRC[31:0]; 

DEST[63:32] ^ DEST[63:32] + SRC[63:32]; 

PADDQ (with 64-Bit operands) 

DEST[63:0] ^ DEST[63:0] + SRC[63:0]; 

PADDB (Legacy SSE instruction) 

DEST[7:0]^ DEST[7:0] + SRC[7:0]; 

(* Repeat add operation for 2nd through 15th byte *) 

DEST[127:120]^ DEST[127:120] + SRC[127:120]; 

DEST[MAX_VL-1:128] (Unmodified) 

PADDW (Legacy SSE instruction) 

DEST[15:0] ^ DEST[15:0] + SRC[15:0]; 

(* Repeat add operation for 2nd through 7th word *) 

DEST[127:112]^ DEST[127:112] + SRC[127:112]; 

DEST[MAX_VL-1:128] (Unmodified) 

PADDD (Legacy SSE instruction) 

DEST[31:0]^ DEST[31:0] + SRC[31:0]; 

(* Repeat add operation for 2nd and 3th doubleword *) 

DEST[127:96]^ DEST[127:96] + SRC[127:96]; 
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DEST[MAX_VL-1:128] (Unmodified) 

PADDQ (Legacy SSE instruction) 

DEST[63:0]^ DEST[63:0] + SRC[63:0]; 

DEST[127:64]^ DEST[127:64] + SRC[127:64]; 
DEST[MAX_VL-1:128] (Unmodified) 

VPADDB (VEX.IZS encoded instruction) 

DEST[7:0]^ SRC1 [7:0] + SRC2[7:0]; 

(* Repeat add operation for 2nd through 15th byte *) 
DEST[127:120]^ SRC1 [127:120] + SRC2[127:120]; 
DEST[MAX_VL-1:128]^0; 

VPADDW (VEX.128 encoded instruction) 

DEST[15:0] ^ SRC1 [15:0] + SRC2[15:0]; 

(* Repeat add operation for 2nd through 7th word *) 
DEST[127:112]^ SRC1 [127:112] + SRC2[127:112]; 
DEST[MAX_VL-1:128]^0; 

VPADDD (VEX.128 encoded instruction) 

DEST[31:0]^ SRC1 [31:0] + SRC2[31:0]; 

(* Repeat add operation for 2nd and 3th doubleword *) 
DEST[127:96] ^ SRC1 [127:96] + SRC2[127:96]; 
DEST[MAX_VL-1:128]^0; 

VPADDQ (\/EX.128 encoded instruction) 

DEST[63:0]^ SRC1 [63:0] + SRC2[63:0]; 

DEST[127:64] ^ SRC1 [127:64] + SRC2[127:64]; 
DEST[MAX_VL-1:128]^0; 

VPADDB (VEX.256 encoded instruction) 

DEST[7:0]^ SRC1 [7:0] + SRC2[7:0]; 

(* Repeat add operation for 2nd through 31th byte *) 
DEST[255:248]^ SRC1 [255:248] + SRC2[255:248]; 

VPADDW (VEX.256 encoded instruction) 

DEST[15:0] ^ SRC1 [15:0] + SRC2[15:0]; 

(* Repeat add operation for 2nd through 15th word *) 
DEST[255:240]^ SRC1 [255:240] + SRC2[255:240]; 

VPADDD (VEX.256 encoded instruction) 

DEST[31:0]^ SRC1 [31:0] + SRC2[31:0]; 

(* Repeat add operation for 2nd and 7th doubleword *) 
DEST[255:224] ^ SRC1 [255:224] + SRC2[255:224]; 


VPADDQ (VEX.256 encoded instruction) 

DEST[63:0]^ SRC1 [63:0] + SRC2[63:0]; 

DEST[127:64] ^ SRC1 [127:64] + SRC2[127:64]; 
DEST[191:128]^ SRC1 [191:128] + SRC2[191:128]; 
DEST[255:192] ^ SRC1 [255:192] + SRC2[255:192]; 


VPADDB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 
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FOR] ^0 TO KL-1 
i ^j*8 

IF k1 [j] OR *no writemask* 

THEN DEST[l+7:i] ^ SRC1 [l+7:i] + SRC2[I+7:I] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[l+7:i] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+7:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

VPADDW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i 16 

IF k1 [j] OR *no writemask* 

THEN DEST[i+15:i] ^ SRC1 [i+15:i] + SRC2[i+15:i] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

VPADDD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN DEST[I+31 :i] ^ SRC1 [i+31 :i] + SRC2[31:0] 

ELSE DEST[i+31:i] ^ SRC1 [1+31:1] + SRC2[I+31:I] 

FI; 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

VPADDQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
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i ^ j * 64 

IF k10] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN DEST[i+63:l] ^ SRC1 [i+63:i] + SRC2[63:0] 

ELSE DEST[I+63:I] ^ SRC1 [1+63:1] + SRC2[I+63:I] 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masklng* ; zeroIng-maskIng 

DEST[I+63:I] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPADDB_m5121 _mm512_add_epi8 (_m5121 a_mSI 21 b) 

VPADDW_m5121 _mm512_add_epi16 (_m5121 a, _m5121 b) 

VPADDB_mSI 2i_mm512_mask_add_epi8 (_m5121 s,_mmask64 m,_mSI 21 a,_mSI 21 b) 

VPADDW_mSI 21 _mm512_mask_add_epi16 (_mSI 2i s,_mmask32 m,_mSI 21 a,_mSI 21 b) 

VPADDB_mSI 2i_mm512_maskz_add_epi8 (_mmask64 m,_mSI 21 a,_mSI 21 b) 

VPADDW_m512i_mm512_maskz_add_epi16(_mmask32 m,_m512ia,_m512i b) 

VPADDB_m256i _mm256_mask_add_epi8 (_m256i s,_mmask32 m,_m256i a,_m256i b) 

VPADDW_m256i _mm256_mask_add_epi16 (_m256i s,_mmaski 6 m,_m256i a,_m256i b) 

VPADDB_m256i_mm256_maskz_add_epi8 (_mmask32 m,_m256i a,_m256i b) 

VPADDW_m256i _mm256_maskz_add_epi16 (_mmaski 6 m,_m256i a,_m256i b) 

VPADDB_ml 281 _mm_mask_add_epi8 (_ml 281 s,_mmaski 6 m,_ml 28i a,_ml 281 b) 

VPADDW_ml 281 _mm_mask_add_epi16 (_ml 281 s,_mmaskB m,_ml 281 a,_ml 281 b) 

VPADDB_ml 281 _mm_maskz_add_epi8 (_mmaski 6 m,_ml 28i a,_ml 281 b) 

VPADDW_ml 281 _mm_maskz_add_epi16 (_mmaskB m,_ml 28i a,_ml 281 b) 

VPADDD _m5121 _mm512_add_epi32( _m5121 a, _m5121 b); 

VPADDD_m512i_mm512_mask_add_epi32(_m512i s,_mmaski 6 k,_m512i a,_m512i b); 

VPADDD_mSI 2i_mm512_maskz_add_epi32(_mmaski 6 k,_mSI 21 a,_mSI 21 b); 

VPADDD_m256i _mm256_mask_add_epi32(_m256i s,_mmaskS k,_m256i a,_m256i b); 

VPADDD_m256i _mm256_maskz_add_epi32(_mmaskS k,_m256i a,_m256i b); 

VPADDD_ml 281 _mm_mask_add_epi32(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPADDD_m128i_mm_maskz_add_epi32(_mmaskS k,_ml 281 a,_ml 281 b); 

VPADDQ_m512i _mm512_add_epi64(_m5121 a, _m512i b); 

VPADDQ_mSI 21 _mm512_mask_add_epi64(_m512i s,_mmaskS k,_m512i a,_m512i b); 

VPADDQ_mSI 21 _mm512_maskz_add_epi64(_mmaskS k,_mSI 21 a,_mSI 21 b); 

VPADDQ_m256i _mm256_mask_add_epi64(_m256i s,_mmaskS k,_m256i a,_m256i b); 

VPADDQ_m256i _mm256_maskz_add_epi64(_mmaskS k,_m256i a,_m256i b); 

VPADDQ_ml 28i _mm_mask_add_epi64(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPADDQ_ml 281 _mm_maskz_add_epi64(_mmaskS k,_ml 281 a,_ml 281 b); 

PADDB_ml 281 _mm_add_epi8 (_ml 281 a,_ml 281 b); 

PADDW_ml 281 _mm_add_epi16 (_ml 281 a,_ml 28i b); 

PADDD_ml 281 _mm_add_epi32 (_ml 281 a,_ml 28i b); 

PADDQ_ml 281 _mm_add_epi64 (_ml 281 a,_ml 28i b); 

VPADDB_m256i _mm256_add_epi8 (_m256ia,_m256i b); 

VPADDW _m256i _mm256_add_epi16 (_m256i a, _m256i b); 

VPADDD _m256i _mm256_add_epi32 (_m256i a, _m256i b); 

VPADDQ _m256i _mm256_add_epi64 (_m256i a, _m256i b); 
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PADDB_m64 _mm_add_pl8(_m64 ml,_m64 m2) 

PADDW_m64 _mm_add_pi 16(_m64 m 1,_m64 m2) 

PADDD_m64 _mm_add_pi32(_m64 ml,_m64 m2) 

PADDQ_m64 _mm_add_pl64(_m64 ml,_m64 m2) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded VPADDD/Q, see Exceptions Type E4. 
EVEX-encoded VPADDB/W, see Exceptions Type E4.nb. 


4-210 Vol. 2B 


PADDB/PADDW/PADDD/PADDQ-Add Packed Integers 


INSTRUCTION SET REFERENCE, M-U 


PADDSB/PADDSW—Add Packed Signed Integers with Signed Saturation 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF EC /r' 

PADDSB mm, mm/m64 

RM 

V/V 

MMX 

Add packed signed byte integers from 
mm/m64 and mm and saturate the results. 

66 OF EC /r 

PADDSB xmm 1, xmmZ/m 1Z8 

RM 

v/v 

SSE2 

Add packed signed byte integers from 
xmmZ/mlZ8and xmml saturate the results. 

OF ED /r' 

PADDSW mm, mm/m64 

RM 

V/V 

MMX 

Add packed signed word integers from 
mm/m64 and mm and saturate the results. 

66 OF ED Ir 

PADDSW xmm 1, xmmZ/m 1Z8 

RM 

v/v 

SSE2 

Add packed signed word integers from 
xmmZ/m 128 and xmm 1 and saturate the 
results. 

VEX.NDS.128.66.0F.WIG EC It 

VPADDSB xmml, xmmZ, xmm3/m1Z8 

RVM 

v/v 

AVX 

Add packed signed byte integers from 
xmm3/m 128 and xmmZ saturate the results. 

VEX.NDS.128.66.0F.WIC ED It 

VPADDSW xmm 1, xmmZ, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed signed word integers from 
xmm3/m 128 and xmm2 and saturate the 
results. 

VEX.NDS.256.66.0F.WIG EC It 

VPADDSB ymm 7, ymmZ, ymm3/mZS6 

RVM 

v/v 

AVX2 

Add packed signed byte integers from ymm2, 
and ymm3/mZS6 and store the saturated 
results in ymmi. 

VEX.NDS.256.66.0F.WIG ED It 

VPADDSW ymm 1, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Add packed signed word integers from ymm2, 
and ymm3/mZS6 and store the saturated 
results in ymmi. 

EVEX.NDS.128.66.0F.WIG EC It 

VPADDSB xmml [k1}{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed signed byte integers from xmm2, 
and xmm3/m 128 and store the saturated 
results in xmml under writemask k1. 

EVEX.NDS.256.66.0F.WIG EC It 

VPADDSB ymm 1 {k 1 }[z}, ymmZ, ymm3/mZS6 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed signed byte integers from ymm2, 
and ymm3/mZS6 and store the saturated 
results in ymmi under writemask k1. 

EVEX.NDS.512.66.0F.WIG EC It 

VPADDSB zmm! {kl}[z}, zmmZ, zmm3/m512 

FVM 

v/v 

AVX512BW 

Add packed signed byte integers from zmm2, 
and zmm3/m512 and store the saturated 
results in zmmi under writemask k1. 

EVEX.NDS.128.66.0F.WIG ED It 

VPADDSW xmml {k1}{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed signed word integers from xmm2, 
and xmm3/m 128 and store the saturated 
results in xmml under writemask k1. 

EVEX.NDS.256.66.0F.WIG ED It 

VPADDSW ymmi [k1 }{z}, ymmZ, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed signed word integers from ymm2, 
and ymm3/mZS6 and store the saturated 
results in ymmi under writemask k1. 

EVEX.NDS.512.66.0F.WIG ED It 

VPADDSW zmm! {k1}{z}, zmmZ, zmm3/m512 

FVM 

v/v 

AVX512BW 

Add packed signed word integers from zmm2, 
and zmm3/m512 and store the saturated 
results in zmmi under writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 
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Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD add of the packed signed integers from the source operand (second operand) and the destination 
operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the 
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a SIMD operation. 
Overflow is handled with signed saturation, as described in the following paragraphs. 

(V)PADDSB performs a SIMD add of the packed signed integers with saturation from the first source operand and 
second source operand and stores the packed integer results in the destination operand. When an individual byte 
result is beyond the range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value 
of 7FH or 80H, respectively, is written to the destination operand. 

(V)PADDSW performs a SIMD add of the packed signed word integers with saturation from the first source operand 
and second source operand and stores the packed integer results in the destination operand. When an individual 
word result is beyond the range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the satu¬ 
rated value of 7FFFH or 8000H, respectively, is written to the destination operand. 

EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an 
ZMM/YMM/XMM register or a memory location. The destination operand is an ZMM/YMM/XMM register. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register 
or a 256-bit memory location. The destination operand is a YMM register. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) of 
the corresponding register destination are zeroed. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding register destination are unmodified. 

Operation 

PADDSB (with 64-bit operands) 

DEST[7:0] ^ SaturateToSignedByte(DEST[7:0] + SRC (7:0]); 

(* Repeat add operation for 2nd through 7th bytes *) 

DEST[63:56] ^ SaturateToSignedByte(DEST[63:56] + SRC[63:56]); 

PADDSB (with 128-bit operands) 

DEST[7:0] ^SaturateToSignedByte (DEST[7:0] + SRC[7:0]); 

(* Repeat add operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToSignedByte (DEST[111:120] + SRC[127:120]); 

VPADDSB (VEX.128 encoded version) 

DEST[7:0] ^ SaturateToSignedByte (SRC1 [7:0] + SRC2[7:0]); 

(* Repeat subtract operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToSignedByte (SRC1 [111:120] + SRC2[127:120]); 

DEST[VLMAX-1:128] ^0 

VPADDSB {VEX.256 encoded version) 

DEST[7:0] ^ SaturateToSignedByte (SRC1 [7:0] + SRC2[7:0]); 

(* Repeat add operation for 2nd through 31 st bytes *) 

DEST[255:248]^ SaturateToSignedByte (SRC1 [255:248] + SRC2[255:248]); 
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VPADDSB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR) ^0 TO KL-1 
i ^j*8 

IF k10] OR *no wrltemask* 

THEN DEST[i+7:l] ^ SaturateToSignedByte (SRC1 [i+7:i] + SRC2[i+7:i]) 
ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

PADDSW (with 64-bit operands) 

DEST[15:0] ^ SaturateToSignedWord(DEST[15:0] -r SRC[15:0]); 

(* Repeat add operation for 2nd and 7th words *) 

DEST[63:48] ^ SaturateToSignedWord(DEST[63:48] -r SRC[63:48]); 

PADDSW (with 128-bit operands) 

DEST[15:0] ^ SaturateToSignedWord (DEST[15:0] -r SRC[15:0]); 

(* Repeat add operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToSignedWord (DEST[127:112] -r SRC[127:112]); 

VPADDSW (VEX.128 encoded version) 

DEST[15:0] ^ SaturateToSignedWord (SRC1 [15:0] + SRC2[15:0]); 

(* Repeat subtract operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToSignedWord (SRC1 [127:112] + SRC2[127:112]); 
DEST[VLMAX-1:128]^0 

VPADDSW (VEX.256 encoded version) 

DEST[15:0] ^ SaturateToSignedWord (SRC1 [15:0] + SRC2[15:0]); 

(* Repeat add operation for 2nd through 15th words *) 

DEST[255:240] ^ SaturateToSignedWord (SRC1 [255:240] + SRC2[255:240]) 

VPADDSW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^J* 16 

IF k10] OR *no writemask* 

THEN DEST[i+15:i] ^ SaturateToSignedWord (SRC1 [i+15:i] + SRC2[i+15:i]) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 
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Intel C/C++ Compiler Intrinsic Equivalents 

PADDSB: _nn64 _mm_adds_pi8(_m64 m 1,_m64 m2) 

(\/)PADDSB: ml 281 _mm_adds_epl8 ( ml 281 a, ml 281 b) 

VPADDSB: _m256i _mm256_adds_epl8 ( m256l a, m256l b) 

PADDSW: _m64 _mm_adds_pl16(_m64 ml,_m64 m2) 

(V)PADDSW: _m1281 _mm_adds_epl16 (_m1281 a, _m1281 b) 

VPADDSW: _m256l _mm256_adds_epl16 (_m256i a, _m256l b) 

VPADDSB_m5121 _mm512_adds_epl8 (_m5121 a, _m512i b) 

VPADDSW_m512i _mm512_adds_epl16 (_m512i a, _m5121 b) 

VPADDSB_mSI 2i _mm512_mask_adds_epi8 (_mSI 21 s,_mmasl<64 m,_m5121 a,_mSI 2i b) 

VPADDSW_mSI 2i _mm512_mask_adds_epi16 (_mSI 21 s,_mmask32 m,_m512i a,_mSI 21 b) 

VPADDSB_mSI 2i _mm512_maskz_adds_epl8 (_mmask64 m,_mSI 21 a,_m512i b) 

VPADDSW_mSI 21 _mm512_maskz_adds_epl16 (_mmask32 m,_mSI 21 a,_mSI 2i b) 

VPADDSB_m256i _mm256_mask_adds_epl8 (_m256l s,_mmask32 m,_m256l a,_m256l b) 

VPADDSW_m256i_mm256_mask_adds_epi16 (_m256l s,_mmask16 m,_m256l a,_m256i b) 

VPADDSB_m256i _mm256_maskz_adds_epl8 (_mmask32 m,_m256l a,_m256l b) 

VPADDSW_m256l _mm256_maskz_adds_epl16 (_mmask16 m,_m256l a,_m256i b) 

VPADDSB_ml 28i _mm_mask_adds_epl8 (_ml 281 s,_mmask16 m,_ml 281 a,_ml 281 b) 

VPADDSW_ml 28i _mm_mask_adds_epl16 (_ml 281 s,_mmaskS m,_ml 281 a,_ml 281 b) 

VPADDSB_ml 281 _mm_maskz_adds_epl8 (_mmask16 m,_m128i a,_ml 281 b) 

VPADDSW_ml 28i _mm_maskz_adds_epl16 (_mmaskS m,_ml 28i a,_ml 281 b) 


Flags Affected 

None. 


SIMD Floating-Point Exceptions 

None. 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature Flag 

Description 

OF DC /r' 

PADDUSB mm, mm/m64 

RM 

V/V 

MMX 

Add packed unsigned byte integers from 
mm/m64 and mm and saturate the results. 

66 OF DC/r 

PADDUSB xmml, xmm2/ml28 

RM 

v/v 

SSE2 

Add packed unsigned byte integers from 
xmm2/m 728 and xmm 7 saturate the results. 

OF DD /r' 

PADDUSW mm, mm/m64 

RM 

V/V 

MMX 

Add packed unsigned word integers from 
mm/m64 and mm and saturate the results. 

66 0FDD/r 

PADDUSW xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Add packed unsigned word integers from 
xmm2/m 7 28 to xmm 7 and saturate the 
results. 

VEX.NDS.128.660F.WIGDC/r 

VPADDUSB xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed unsigned byte integers from 
xmm3/m 7 28 to xmm2 and saturate the 
results. 

VEX.NDS.128.66.0F.WIGDD/r 

VPADDUSW xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add packed unsigned word integers from 
xmm3/m 7 28 to xmm2 and saturate the 
results. 

VEX.NDS.256.66.0F.WIC DC/r 

VPADDUSB ymm 7, ymm2, \/mm3/m256 

RVM 

v/v 

AVX2 

Add packed unsigned byte integers from 
ymm2, and ymm3/m256 and store the 
saturated results in ymml. 

VEX.NDS.256.66.0F.WIC DD /r 

VPADDUSW ymml, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Add packed unsigned word integers from 
ymm2, and ymm3/m256 and store the 
saturated results in ymml. 

EVEX.NDS.128.66.0F.WIG DC /r 

VPADDUSB xmm 1 [k1}{z}, xmm2, xmm3/m 7 28 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed unsigned byte integers from 
xmm2, and xmm3/m 728 and store the 
saturated results in xmml under writemask 
k1. 

EVEX.NDS.256.66.0F.WIG DC /r 

VPADDUSB ymm 7 {k1}[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed unsigned byte integers from 
ymm2, and ymm3/m256 and store the 
saturated results in ymml under writemask 
k1. 

EVEX.NDS.512.66.0F.WIG DC /r 

VPADDUSB zmmi {kl}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Add packed unsigned byte integers from 
zmm2, and zmm3/m512 and store the 
saturated results in zmmi under writemask 
k1. 

EVEX.NDS.128.66.0F.WIG DD /r 

VPADDUSW xmm7 {kl]{z}, xmm2, xmm3/ml28 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed unsigned word integers from 
xmm2, and xmm3/m 7 28 and store the 
saturated results in xmml under writemask 
k1. 

EVEX.NDS.256.66.0F.WIG DD /r 

VPADDUSW ymm 7 {klXz}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Add packed unsigned word integers from 
ymm2, and ymm3/m256 and store the 
saturated results in ymml under writemask 
k1. 
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EVEX.NDS.512.66.0F.WIG DD /r 

FVM 

V/V 

AVX512BW 

Add packed unsigned word integers from 

VPADDUSW zmml {k1}[z}, zmmZ, zmm3/m51Z 




zmm2, and zmm3/m5l2and store the 
saturated results in zmml under writemask 
k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvu (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvu (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD add of the packed unsigned integers from the source operand (second operand) and the destina¬ 
tion operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the 
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a SIMD operation. 
Overflow is handled with unsigned saturation, as described in the following paragraphs. 

(V)PADDUSB performs a SIMD add of the packed unsigned integers with saturation from the first source operand 
and second source operand and stores the packed integer results in the destination operand. When an individual 
byte result is beyond the range of an unsigned byte integer (that is, greater than FFH), the saturated value of FFH 
is written to the destination operand. 

(V)PADDUSW performs a SIMD add of the packed unsigned word integers with saturation from the first source 
operand and second source operand and stores the packed integer results in the destination operand. When an 
individual word result is beyond the range of an unsigned word integer (that is, greater than FFFFH), the saturated 
value of FFFFH is written to the destination operand. 

EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an 
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination is an ZMM/YMM/XMM register. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register 
or a 256-bit memory location. The destination operand is a YMM register. 

VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding destination register destination are zeroed. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding register destination are unmodified. 

Operation 

PADDUSB (with 64-bit operands) 

DEST[7:0] ^ SaturateToUnsignedByte(DEST[7:0] + SRC (7:0]); 

(* Repeat add operation for 2nd through 7th bytes *) 

DEST[63:56] ^ SaturateToUnsignedByte(DEST[63:56] + SRC[63:56] 

PADDUSB (with 128-bit operands) 

DEST[7:0] ^ SaturateToUnsignedByte (DEST[7:0] + SRC[7:0]); 

(* Repeat add operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToUnSignedByte (DEST[127:120] + SRC[127:120]); 
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VPADDUSB (VEX.128 encoded version) 

DEST[7:0] ^ SaturateToUnsIgnedByte (SRC1 [7:0] + SRC2[7:0]); 

(* Repeat subtract operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToUnsIgnedByte (SRC1 [111:120] + SRC2[127:120]); 
DEST[VLMAX-1:128]^0 

VPADDUSB (VEX.256 encoded version) 

DEST[7:0] ^ SaturateToUnsIgnedByte (SRC1 [7:0] + SRC2[7:0]); 

(* Repeat add operation for 2nd through 31 st bytes *) 

DEST[255:248]^ SaturateToUnsIgnedByte (SRC1 [255:248] + SRC2[255:248]); 

PADDUSW (with 64-bit operands) 

DEST[15:0] ^ SaturateToUnsignedWord(DEST[15:0] + SRC[15:0]); 

(* Repeat add operation for 2nd and 3rd words *) 

DEST[63:48] ^ SaturateToUnsignedWord(DEST[63:48] + SRC[63:48]); 

PADDUSW (with 128-bit operands) 

DEST[15:0] ^ SaturateToUnsignedWord (DEST[15:0] + SRC[15:0]); 

(* Repeat add operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToUnSignedWord (DEST[127:112] + SRC[127:112]); 

VPADDUSW (VEX.128 encoded version) 

DEST[15:0] ^ SaturateToUnsignedWord (SRC1 [15:0] -h SRC2[1 5:0]); 

(* Repeat subtract operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToUnsignedWord (SRC1 [127:112] -h SRC2[1 27:112]); 
DEST[VLMAX-1:128]^0 

VPADDUSW (VEX.256 encoded version) 

DEST[15:0] ^ SaturateToUnsignedWord (SRC1 [15:0] -h SRC2[1 5:0]); 

(* Repeat add operation for 2nd through 15th words *) 

DEST[255:240] ^ SaturateToUnsignedWord (SRC1 [255:240] + SRC2[255:240]) 

VPADDUSB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^j*8 

IF k10] OR *no writemask* 

THEN DEST[i-H7:i] ^ SaturateToUnsIgnedByte (SRC1 [^7:1] + SRC2[i-H7:i]) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[K7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i-H7:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

VPADDUSW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^J* 16 

IF k10] OR *no writemask* 


PADDUSB/PADDUSW—Add Packed Unsigned Integers with Unsigned Saturation 


Vol. 2B 4-217 


INSTRUCTION SET REFERENCE, M-U 


THEN DEST[I+15:i] ^ SaturateToUnsIgnedWord (SRC1 [1+15:1] + SRC2[I+15:1]) 
ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 


Intel C/C-r-i- Compiler Intrinsic Equivalents 

PADDUSB: _m64 _mm_adds_pu8(_m64 ml,_m64 m2) 

PADDUSW: _m64 _mm_adds_pu16(_m64 ml,_m64 m2) 

(V)PADDUSB: _m128i _mm_adds_epu8 (_m128i a, _m1281 b) 

(V)PADDUSW: _m128i _mm_adds_epu16 (_m1281 a, _m128i b) 

VPADDUSB: _m256l _mm256_adds_epu8 (_m256i a, _m256l b) 

VPADDUSW: _m256i _mm256_adds_epu16 (_m256l a m256i b) 

VPADDUSB_m5121 _mm512_adds_epu8 (_m512i a, _m5121 b) 

VPADDUSW_m512i _mm512_adds_epu16 (_m5121 a, _m5121 b) 

VPADDUSB_m5121 _mm512_mask_adds_epu8 (_m5121 s, mmask64 m,_m512i a,_m5121 b) 

VPADDUSW_m5121 _mm512_mask_adds_epu16 (_m5121 s,_mmask32 m, m512i a,_m5121 b) 

VPADDUSB_m5121 _mm512_maskz_adds_epu8 (_mmask64 m,_m5121 a,_m5121 b) 

VPADDUSW_m512i _mm512_maskz_adds_epu16 (_mmask32 m,_m512i a,_m5121 b) 

VPADDUSB_m256l _mm256_mask_adds_epu8 (_m256l s, mmask32 m,_m256i a,_m256l b) 

VPADDUSW_m256i _mm256_mask_adds_epu16 (_m256i s,_mmaski 6 m, m256i a,_m256l b) 

VPADDUSB_m256l_mm256_maskz_adds_epu8 (_mmask32 m,_m256i a,_m256l b) 

VPADDUSW_m256i _mm256_maskz_adds_epu16 (_mmaski 6 m,_m256i a,_m256l b) 

VPADDUSB_ml 281 _mm_mask_adds_epu8 (_ml 28i s,_mmaski 6 m,_ml 281 a,_ml 28i b) 

VPADDUSW_m128i_mm_mask_adds_epu16 (_ml 281 s,_mmaskS m,_ml 281 a,_ml 281 b) 

VPADDUSB_ml 281 _mm_maskz_adds_epu8 (_mmaski 6 m,_ml 281 a,_ml 281 b) 

VPADDUSW_m128i_mm_maskz_adds_epu16 (_mmaskS m,_ml 281 a,_ml 281 b) 


Flags Affected 

None. 


Numeric Exceptions 

None. 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4.nb. 
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Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 3A OF/r ib ' 

PALIGNR mm 7, mm2/m64, immS 

RMI 

V/V 

SSSE3 

Concatenate destination and source operands, 
extract byte-aligned result shifted to the right by 
constant value in /mmSinto mm7. 

66 0F3A OF/rib 

PALIGNR xmml, xmm2/m128, imm8 

RMI 

v/v 

SSSE3 

Concatenate destination and source operands, 
extract byte-aligned result shifted to the right by 
constant value in /mmSinto xmml. 

VEX.NDS.1 28.66.0F3A.WIG OF It ib 

VPALIGNR xmml, xmmZ, xmm3/m128, imm8 

RVMI 

V/V 

AVX 

Concatenate xmm2 and xmm3/m 128, extract 
byte aligned result shifted to the right by 
constant value in immSand result is stored in 
xmml. 

VEX.NDS.256.66.0F3A.WIG OF It ib 

VPALIGNR ymmi, ymmZ, ymm3/m256, imm8 

RVMI 

v/v 

AVX2 

Concatenate pairs of 16 bytes in ymm2 and 
ymm3/m256 into 32-byte intermediate result, 
extract byte-aligned, 16-byte result shifted to 
the right by constant values in /mmSfrom each 
intermediate result, and two 16-byte results are 
stored in ymmi. 

EVEX.NDS.128.66.0F3A.WIG OF It ib 

VPALIGNR xmml {k1}[z}, xmm2, xmm3/m128, 
imm8 

FVM 

v/v 

AVX512VL 

AVX512BW 

Concatenate xmm2 and xmm3/m128 into a 32- 
byte intermediate result, extract byte aligned 
result shifted to the right by constant value in 
imm8 and result is stored in xmml. 

EVEX.NDS.256.66.0F3A.WIG OF It ib 

VPALIGNR ymmi [k1}{z}, ymm2, ymm3/m256, 
imm8 

FVM 

v/v 

AVX512VL 

AVX512BW 

Concatenate pairs of 16 bytes in ymm2 and 
ymm3/m256 into 32-byte intermediate result, 
extract byte-aligned, 16-byte result shifted to 
the right by constant values in imm8 from each 
intermediate result, and two 16-byte results are 
stored in ymmi. 

EVEX.NDS.512.66.0F3A.WIG OF It ib 

VPALIGNR zmmi [k1 }[z}, zmm2, zmm3/m512, 
imm8 

FVM 

v/v 

AVX512BW 

Concatenate pairs of 16 bytes in zmm2 and 
zmm3/m512 into 32-byte intermediate result, 
extract byte-aligned, 16-byte result shifted to 
the right by constant values in imm8 from each 
intermediate result, and four 16-byte results are 
stored in zmmi. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

(V)PALIGNR concatenates the destination operand (the first operand) and the source operand (the second 
operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant imme¬ 
diate, and extracts the right-aligned result into the destination. The first and the second operands can be an MMX, 
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XMM or a VMM register. The immediate value is considered unsigned. Immediate shift counts larger than the 2L 
(i.e. 32 for 128-bit operands, or 16 for 64-bit operands) produce a zero result. Both operands can be MMX regis¬ 
ters, XMM registers or VMM registers. When the source operand is a 128-bit memory operand, the operand must 
be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. 

In 64-bit mode and not encoded by VEX/EVEX prefix, use the REX prefix to access additional registers. 

128-bit Legacy SSE version: Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

EVEX.512 encoded version: The first source operand is a ZMM register and contains four 16-byte blocks. The 
second source operand is a ZMM register or a 512-bit memory location containing four 16-byte block. The destina¬ 
tion operand is a ZMM register and contain four 16-byte results. The imm8[7:0] is the common shift count 

used for each of the four successive 16-byte block sources. The low 16-byte block of the two source operands 
produce the low 16-byte result of the destination operand, the high 16-byte block of the two source operands 
produce the high 16-byte result of the destination operand and so on for the blocks in the middle. 

VEX.256 and EVEX.256 encoded versions: The first source operand is a VMM register and contains two 16-byte 
blocks. The second source operand is a VMM register or a 256-bit memory location containing two 16-byte block. 
The destination operand is a VMM register and contain two 16-byte results. The imm8[7:0] is the common shift 
count used for the two lower 16-byte block sources and the two upper 16-byte block sources. The low 16-byte 
block of the two source operands produce the low 16-byte result of the destination operand, the high 16-byte block 
of the two source operands produce the high 16-byte result of the destination operand. The upper bits (MAX_VL- 
1:256) of the corresponding ZMM register destination are zeroed. 

VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand 
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits 
(MAX_VL-1:128) of the corresponding ZMM register destination are zeroed. 

Concatenation is done with 128-bit data in the first and second source operand for both 128-bit and 256-bit 
instructions. The high 128-bits of the intermediate composite 256-bit result came from the 128-bit data from the 
first source operand; the low 128-bits of the intermediate result came from the 128-bit data of the second source 
operand. 

Note: VEX.L must be 0, otherwise the instruction will #UD. 


127 


0 127 


0 


SRC1 



Figure 4-7. 256-bit VPALIGN Instruction Operation 


Operation 

PALIGNR (with 64-bit operands) 

tempi [127:0] = C0NCATENATE(DEST,SRC)»(imm8*8) 
DEST[63:0] = tempi [63:0] 
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PALIGNR (with 128-bit operands) 

tempi [255:0] ^ ((DEST[127:0] « 128) OR SRC[127:0])>>(imm8*8); 

DEST[127:0] ^ tempi [127:0] 

DEST[VLMAX-1:128] (Unmodified) 

VPALIGNR (VEX.128 encoded version) 

tempi [255:0] ^ ((SRC1 [127:0] « 128) OR SRC2[127:0])»(imm8*8); 

DEST[127:0] ^ tempi [127:0] 

DEST[VLMAX-1:128]^0 

VPALIGNR (VEX.256 encoded version) 

tempi [255:0] ^ ((SRC1 [127:0] « 128) OR SRC2[127:0])»(imm8[7:0]*8); 

DEST[127:0] ^ tempi [127:0] 

tempi [255:0] ^ ((SRC1 [255:128] << 128) OR SRC2[255:128])»(imm8[7:0]*8); 

DEST[MAX_VL-1:128] ^ tempi [127:0] 

VPALIGNR (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR I <- 0 TO VL-1 with increments of 128 

tempi [255:0] ^ ((SRC1 [1+127:1] « 128) OR SRC2[I+127:l])>>(imm8[7:0]*8); 

TMP_DEST[I+127:1] ^ tempi [127:0] 

ENDFOR; 

FOR] ^0 TO KL-1 
i ^J*8 

IF k10] OR *no writemask* 

THEN DEST[i+7:i] ^ TMP_DEST[i+7:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

PALIGNR: _m64 _mm_alignr_pi8 ( m64 a, m64 b, int n) 

(V)PALIGNR: _ml 28i _mm_alignr_epi8 ( ml 28i a, ml 28i b, int n) 

VPALIGNR: _m256i _mm256_alignr_epi8 (_m256i a,_m256i b, const int n) 

VPALIGNR_m5121 _mm512_alignr_epi8 (_m512i a,_m5121 b, const int n) 

VPALIGNR_m5121 _mm512_mask_alignr_epi8 (_m512i s,_mmask64 m,_m512i a, m5121 b, const int n) 

VPALIGNR_m512i_mm512_maskz_alignr_epi8 (_mmask64 m,_m512i a,_m5121 b, const int n) 

VPALIGNR_m256i _mm256_mask_alignr_epi8 (_m256i s,_mmask32 m,_m256i a, m256i b, const int n) 

VPALIGNR_m256i _mm256_maskz_alignr_epi8 (_mmask32 m,_m256i a,_m256i b, const int n) 

VPALIGNR_ml 281 _mm_mask_alignr_epi8 (_ml 281 s,_mmaski 6 m,_ml 281 a,_ml 28i b, const int n) 

VPALIGNR_ml 28i _mm_maskz_alignr_epi8 (_mmaski 6 m,_ml 28i a,_ml 281 b, const int n) 

SIMD Floating-Point Exceptions 

None. 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PAND-Logical AND 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature Flag 

Description 

OF DB /r' 

PAND mm, mm/m64 

RM 

V/V 

MMX 

Bitwise AND mm/m64 and mm. 

66 0FDB/r 

PAND xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Bitwise AND of xmmZ/ml28and xmml. 

VEX.NDS.128.66.0F.WIG DB/r 

VPAND xmm 1, xmm2, xmm3/m 128 

RVM 

V/V 

AVX 

Bitwise AND of xmm3/m 128 and xmm. 

VEX.NDS.256.66.0F.WIC DB It 

VPAND ymmi, \/mm2,ymm3/.m256 

RVM 

v/v 

AVX2 

Bitwise AND of ymmZ, and ymm3/m256 and 
store result in ymmi. 

EVEX.NDS.12B.66.0F.W0 DB It 

VPANDDxmmI {k1}[z], xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise AND of packed doubleword integers in 
xmm2 and xmm3/m12B/m32bcst and store 
result in xmml using writemask k1. 

EVEX.NDS.256.66.0F.W0 DB It 

VPANDD ymmi {k1]{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise AND of packed doubleword integers in 
ymm2 and ymm3/m256/m32bcst and store 
result in ymmi using writemask k1. 

EVEX.NDS.512.66.0F.W0 DB Ir 

VPANDD zmmi {k1]{z], zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Bitwise AND of packed doubleword integers in 
zmm2 and zmm3/m512/m32bcst and store 
result in zmmi using writemask k1. 

EVEX.NDS.128.66.0F.W1 DB/r 

VPANDQxmmI {k1]{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise AND of packed quadword integers in 
xmm2 and xmm3/m12B/m64bcst and store 
result in xmml using writemask k1. 

EVEX.NDS.256.66.0F.W1 DB /r 

VPANDQ ymmi [k1 }{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise AND of packed quadword integers in 
ymm2 and ymm3/m256/m64bcst and store 
result in ymmi using writemask k1. 

EVEX.NDS.512.66.0F.W1 DB/r 

VPANDQ zmmi {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Bitwise AND of packed quadword integers in 
zmm2 and zmm3/m512/m64bcst and store 
result in zmmi using writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvw (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a bitwise logical AND operation on the first source operand and second source operand and stores the 
result in the destination operand. Each bit of the result is set to 1 if the corresponding bits of the first and second 
operands are 1, otherwise it is set to 0. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 
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Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are unmodified. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with 
writemask kl at 32/64-bit granularity. 

VEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) 
of the corresponding ZMM register destination are zeroed. 

VEX. 128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

Operation 

PAND (64-bit operand) 

DEST ^ DEST AND SRC 


PAND (128-bit Legacy SSE version) 

DEST ^ DEST AND SRC 
DEST[VLMAX-1:128] (Unmodified) 

VPAND (VEX.128 encoded version) 

DEST ^ SRC1 AND SRC2 
DEST[VLMAX-1:128)^0 

VPAND (VEX.256 encoded instruction) 

DEST[255:0] ^ (SRC1 [255:0] AND SRC2[255:0]) 

DEST[VLMAX-1:256]^0 

VPANDD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF kl [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN DEST[I-h 31 :i] ^ SRC1 [i+31 :i] BITWISE AND SRC2[31:0] 
ELSE DEST[i+31 :l] ^ SRC1 [i-H31 :i] BITWISE AND SRC2[i-H31 :i] 
FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[l-r31 :l] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i-H31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VPANDQ (EVEX encoded versions) 
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(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN DEST[i+63:l] ^ SRC1 [i+63:l] BITWISE AND SRC2[63:0] 

ELSE DEST[I+63:I] ^ SRC1 [1+63:1] BITWISE AND SRC2[l+63:i] 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroIng-maskIng 

DEST[I+63:I] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPANDD _m5121 _mm512_and_epi32(_m5121 a,_m5121 b); 

VPANDD_m5121 _mm512_mask_and_epi32(_m5121 s,_mmaski 6 k,_m5121 a,_m5121 b); 

VPANDD_m512i_mm512_maskz_and_epi32(_mmaski 6 k,_m512i a,_m512i b); 

VPANDQ_m5121 _mm512_and_epi64(_m5121 a_m512i b); 

VPANDQ_m5121 _mm512_mask_and_epi64(_m512i s,_mmaskB k,_m512i a,_m512i b); 

VPANDQ_m5121 _mm512_maskz_and_epi64(_mmaskB k,_m5121 a,_m5121 b); 

VPANDND_m256i _mm256_mask_and_epi32(_m256i s,_mmaskB k,_m256i a,_m256i b); 

VPANDND_m256i _mm256_maskz_and_epi32(_mmaskB k,_m256i a,_m256i b); 

VPANDND_ml 281 _mm_mask_and_epi32(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPANDND_m128i_mm_maskz_and_epi32(_mmaskB k,_ml 281 a,_ml 281 b); 

VPANDNQ_m256i _mm256_mask_and_epi64(_m256i s,_mmaskB k,_m256i a,_m256i b); 

VPANDNQ_m256i _mm256_maskz_and_epi64(_mmaskB k,_m256i a,_m256i b); 

VPANDNQ_ml 281 _mm_mask_and_epi64(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPANDNQ_ml 28i _mm_maskz_and_epi64(_mmaskB k,_ml 281 a,_ml 281 b); 

PAND: _m64 _mm_and_si64 (_m64 ml,_m64 m2) 

(V)PAND:_m1281 _mm_and_si128 (_m1281 a, _m1281 b) 

VPAND: _m256i _mm256_and_si256 (_m256i a m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4. 
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PANDN-Logical AND NOT 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF DF /r' 

PANDN mm, mm/m64 

RM 

V/V 

MMX 

Bitwise AND NOT of mm/m64 and mm. 

66 OFDF/r 

PANDN xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Bitwise AND NOT of xmm2/m128and xmmh 

VEX.NDS.128.66.0F.WIG DF /r 

VPANDN xmm 1, xmmZ, xmm3/m 128 

RVM 

V/V 

AVX 

Bitwise AND NOT of xmm3/m128 and xmm2. 

VEX.NDS.256.66.0F.WIC DF /r 

VPANDN ymm 7, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Bitwise AND NOT of ymmZ, and ymm3/m256 
and store result in ymml. 

EVEX.NDS.128.66.0F.W0 DF /r 

VPANDNDxmml {k1}{z}, xmrTi2, 
xmm3/nn128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise AND NOT of packed doubleword 
integers in xmm2 and xmm3/m128/m32bcst 
and store result in xmmi using writemask k1. 

EVEX.NDS.256.66.0F.W0 DF /r 

VPANDNDymmI [k1 }{z], ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise AND NOT of packed doubleword 
integers in ymm2 and ymm3/m256/rTi32bcst 
and store result in ymml using writemask k1. 

EVEX.NDS.512.66.0F.W0 DF /r 

VPANDND zmmi {k1 }{z], zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Bitwise AND NOT of packed doubleword 
integers in zmm2 and zmm3/m512/m32bcst 
and store result in zmmi using writemask k1. 

EVEX.NDS.128.66.0F.W1 DF/r 

VPANDNQxmmI {k1]{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Bitwise AND NOT of packed guadword 
integers in xmm2 and xmm3/m128/m64bcst 
and store result in xmmi using writemask k1. 

EVEX.NDS.256.66.0F.W1 DF /r 

VPANDNQymmI [k1}[z}, ymm2, 
ymm3/rTi256/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Bitwise AND NOT of packed guadword 
integers in ymm2 and ymm3/m256/m64bcst 
and store result in ymml using writemask k1. 

EVEX.NDS.512.66.0F.W1 DF/r 

VPANDNQzmmI [k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Bitwise AND NOT of packed guadword 
integers in zmm2 and zmm3/m512/m64bcst 
and store result in zmmi using writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a bitwise logical NOT operation on the first source operand, then performs bitwise AND with second 
source operand and stores the result in the destination operand. Each bit of the result is set to 1 if the corre¬ 
sponding bit in the first operand is 0 and the corresponding bit in the second operand is 1, otherwise it is set to 0. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 
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Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are unmodified. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with 
writemask kl at 32/64-bit granularity. 

VEX.256 encoded versions: The first source operand is a YMM register. The second source operand is a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAX_VL-1:256) 
of the corresponding ZMM register destination are zeroed. 

VEX. 128 encoded versions: The first source operand is an XMM register. The second source operand is an XMM 
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

Operation 

PANDN (64-bit operand) 

DEST ^ NOT(DEST) AND SRC 


PANDN (128-bit Legacy SSE version) 

DEST ^ NOT(DEST) AND SRC 
DEST[VLMAX-1:128] (Unmodified) 

VPANDN (VEX.128 encoded version) 

DEST ^ N0T(SRC1) AND SRC2 
DEST[VLMAX-1:128)^0 

VPANDN (VEX.256 encoded instruction) 

DEST[255:0] ^ ((NOT SRC1 [255:0]) AND SRC2[255:0]) 
DEST[VLMAX-1:256]^0 

VPANDND (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF kl 0] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN DEST[i-H31 :i] ^ ((NOT SRC1 [i+31 :i]) AND SRC2[31:0]) 
ELSE DEST[i-H31 :i] ^ ((NOT SRC1 [i+31 :i]) AND SRC2[i+31 :i]) 
FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPANDNQ (EVEX encoded versions) 
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(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN DEST[l+63:i] ^ ((NOT SRC1 [1+63:1]) AND SRC2[63:0]) 

ELSE DEST[i+63:i] ^ ((NOT SRC1 [l+63:i]) AND SRC2[i+63:i]) 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPANDND _m5121 _mm512_andnot_epi32( _m5121 a, _m5121 b); 

VPANDND_m5121 _mm512_mask_andnot_epi32(_m512i s,_mmask16 k,_m512i a,_m512i b); 

VPANDND_m5121 _mm512_maskz_andnot_epi32(_mmaski 6 k,_m5121 a,_m5121 b); 

VPANDND_m256i _mm256_mask_andnot_epi32(_m256i s,_mmask8 k,_m256i a,_m256i b); 

VPANDND_m256i _mm256_maskz_andnot_epi32(_mmask8 k,_m256i a,_m256i b); 

VPANDND_ml 281 _mm_mask_andnot_epi32(_ml 281 s,_mmask8 k,_ml 281 a,_ml 281 b); 

VPANDND_ml 281 _mm_maskz_andnot_epi32(_mmask8 k,_ml 281 a,_ml 281 b); 

VPANDNQ _m5121 _mm512_andnot_epi64( _m5121 a, _m512i b); 

VPANDNQ_m512i_mm512_mask_andnot_epi64(_m512i s,_mmask8 k,_m512i a,_m512i b); 

VPANDNQ_m5121 _mm512_maskz_andnot_epi64(_mmask8 k,_m5121 a,_m5121 b); 

VPANDNQ_m256i _mm256_mask_andnot_epi64(_m256i s,_mmask8 k,_m256i a,_m256i b); 

VPANDNQ_m256i _mm256_maskz_andnot_epi64(_mmask8 k,_m256i a,_m256i b); 

VPANDNQ_m128i_mm_mask_andnot_epi64(_ml 281 s,_mmask8 k,_ml 281 a,_ml 281 b); 

VPANDNQ_ml 281 _mm_maskz_andnot_epi64(_mmask8 k,_ml 28i a,_ml 281 b); 

PANDN:_m64 _mm_andnot_si64 (_m64 ml,_m64 m2) 

(V)PANDN:_ml 281 _mm_andnot_si128 (_ml 281 a,_ml 281 b) 

VPANDN: _m256i _mm256_andnot_si256 (_m256i a,_m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4. 
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PAUSE—Spin Loop Hint 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F3 90 

PAUSE 

NP 

Valid 

Valid 

Gives hint to processor that improves 
performance of spin-wait loops. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Improves the performance of spin-wait loops. When executing a "spin-wait loop," processors will suffer a severe 
performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE 
instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint 
to avoid the memory order violation in most situations, which greatly improves processor performance. For this 
reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops. 

An additional function of the PAUSE instruction is to reduce the power consumed by a processor while executing a 
spin loop. A processor can execute a spin-wait loop extremely quickly, causing the processor to consume a lot of 
power while it waits for the resource it is spinning on to become available. Inserting a pause instruction in a spin- 
wait loop greatly reduces the processor's power consumption. 

This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors. 
In earlier IA-32 processors, the PAUSE instruction operates like a NOP instruction. The Pentium 4 and Intel Xeon 
processors implement the PAUSE instruction as a delay. The delay is finite and can be zero for some processors. 
This instruction does not change the architectural state of the processor (that is, it performs essentially a delaying 
no-op operation). 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

Execute_Next_lnstruction(DELAY); 

Numeric Exceptions 

None. 

Exceptions (All Operating Modes) 

#UD If the LOCK prefix is used. 
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PAVGB/PAVGW—Average Packed Integ 

ers 

Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF EO /r' 

PAVGB mm 7, mm2/m64 

RM 

V/V 

SSE 

Average packed unsigned byte integers from 
mm2/m64and mml with rounding. 

66 OF EO, Ir 

PAVGB xmm 7, xmm2/m 7 28 

RM 

v/v 

SSE2 

Average packed unsigned byte integers from 
xmm2/ml28and xmm! with rounding. 

OF E3 /r' 

PAVGW mm 7, mm2/m64 

RM 

V/V 

SSE 

Average packed unsigned word integers from 
mm2/m64 and mm 7 with rounding. 

66 OF E3 Ir 

PAVGW xmm 7, xmm2/m 7 28 

RM 

v/v 

SSE2 

Average packed unsigned word integers from 
xmm2/ml28and xmmi with rounding. 

VEX.NDS.128.66.0F.WIGE0 Ir 

VPAVGB xmm 7, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Average packed unsigned byte integers from 
xmm3/m128 and xmmZwith rounding. 

VEX.NDS.128.66.0F.WIGE3/r 

VPAVGW xmmi, xmmZ, xmm3/m128 

RVM 

v/v 

AVX 

Average packed unsigned word integers from 
xmm3/ml28and xmmZwith rounding. 

VEX.NDS.256.66.0F.WIC EO Ir 

VPAVGB ymm 7, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Average packed unsigned byte integers from 
ymm2, and ymm3/mZS6 with rounding and 
store to ymm 7. 

VEX.NDS.256.66.0F.WIC E3 Ir 

VPAVGW ymm 7, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Average packed unsigned word integers from 
ymmZ, ymm3/m256 with rounding to ymml. 

EVEX.NDS.1 28.66.0F.WIG EO Ir 

VPAVGB xmm 7 {k1 }{z}, xmm2, xmm3/m 7 28 

FVM 

v/v 

AVX512VL 
AVX512BW 

Average packed unsigned byte integers from 
xmm2, and xmm3/m72S with rounding and 
store to xmmi under writemask k1. 

EVEX.NDS.256.66.0F.WIG EO Ir 

VPAVGB ymm 7 [k1 ][z], ymmZ, ymm3/m256 

FVM 

v/v 

AVX512VL 
AVX512BW 

Average packed unsigned byte integers from 
ymm2, and ymm3/mZS6 with rounding and 
store to ymml under writemask k1. 

EVEX.NDS.512.66.0F.WIG EO Ir 

VPAVGB zmml {k1}[z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Average packed unsigned byte integers from 
zmm2, and zmm3/m572 with rounding and 
store to zmml under writemask k1. 

EVEX.NDS.1 28.66.0F.WIG E3 Ir 

VPAVGW xmm 7 {k1 }{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 
AVX512BW 

Average packed unsigned word integers from 
xmm2, xmm3/m728 with rounding to xmmi 
under writemask k1. 

EVEX.NDS.256.66.0F.WIG E3 Ir 

VPAVGW ymm 7 [k1 }[z}, ymmZ, ymm3/m256 

FVM 

v/v 

AVX512VL 
AVX512BW 

Average packed unsigned word integers from 
ymm2, ymm3/mZS6 with rounding to ymml 
under writemask k1. 

EVEX.NDS.512.66.0F.WIGE3/r 

VPAVGW zmml {k1}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Average packed unsigned word integers from 
zmm2, zmm3/m572 with rounding to zmml 
under writemask k1. 


NOTES: 

1. See note In Section Z.4, "AVX and SSE Instruction Exception Specification" in the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 
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Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the 
destination operand (first operand), and stores the results in the destination operand. For each corresponding pair 
of data elements in the first and second operands, the elements are added together, alls added to the temporary 
sum, and that result is shifted right one bit position. 

The (V)PAVGB instruction operates on packed unsigned bytes and the (V)PAVGW instruction operates on packed 
unsigned words. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM 
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the 
upper bits (MAX_\/L-1:128) of the corresponding register destination are unmodified. 

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM 
register or a 512-bit memory location. The destination operand is a ZMM register. 

VEX.256 and EVEX.256 encoded versions: The first source operand is a VMM register. The second source operand 
is a VMM register or a 256-bit memory location. The destination operand is a VMM register. 

VEX. 128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand 
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits 
(MAX_VL-1:128) of the corresponding register destination are zeroed. 

Operation 

PAVGB (with 64-bit operands) 

DEST[7:0] ^ (SRC[7:0] + DEST[7:0] + 1)» 1; (* Temp sum before shifting is 9 bits *) 

(* Repeat operation performed for bytes 2 through 6 *) 

DEST[63:56] ^ (SRC[63:56] + DEST[63:56] + 1)» 1; 

PAVGW (with 64-bit operands) 

DEST[15:0] ^ (SRC[15:0] + DEST[15:0] + 1)» 1; (* Temp sum before shifting is 17 bits *) 

(* Repeat operation performed for words 2 and 3 *) 

DEST[63:48] ^ (SRC[63:48] + DEST[63:48] + 1)» 1; 

PAVGB (with 128-bit operands) 

DEST[7:0] ^ (SRC[7:0] + DEST[7:0] + 1)» 1; (* Temp sum before shifting is 9 bits *) 

(* Repeat operation performed for bytes 2 through 14 *) 

DEST[127:120] ^ (SRC[127:120] + DEST[127:120] + 1)» 1; 

PAVGW (with 128-bit operands) 

DEST[15:0] ^ (SRC[15:0] + DEST[15:0] + 1)» 1; (* Temp sum before shifting is 17 bits *) 

(* Repeat operation performed for words 2 through 6 *) 

DEST[127:112] ^ (SRC[127:112] + DEST[127:112] + 1)» 1; 
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VPAVGB (VEX.128 encoded version) 

DEST[7:0] ^ (SRC1 [7:0] + SRC2[7:0] + 1) » 1; 

(* Repeat operation performed for bytes 2 through 15 *) 

DEST[127:120] ^ (SRC1 [127:120] + SRC2[127:120] + 1) » 1 
DEST[VLMAX-1:128]^0 

VPAVGW (VEX.128 encoded version) 

DEST[15:0] ^ (SRC1 [15:0] + SRC2[15:0] + 1) » 1 ; 

(* Repeat operation performed for 16-bit words 2 through 7 *) 

DEST[127:112] ^ (SRC1 [127:112] + SRC2[127:112] + 1) » 1 
DEST[VLMAX-1:128]^0 

VPAVGB (VEX.256 encoded instruction) 

DEST[7:0] ^ (SRC1 [7:0] + SRC2[7:0] + 1) » 1; (* Temp sum before shifting is 9 bits *) 

(* Repeat operation performed for bytes 2 through 31) 

DEST[255:248] ^ (SRC1 [255:248] + SRC2[255:248] + 1) » 1; 

VPAVGW (VEX.256 encoded instruction) 

DEST[15:0] ^ (SRC1 [15:0] + SRC2[15:0] + 1) » 1; (* Temp sum before shifting is 17 bits *) 

(* Repeat operation performed for words 2 through 15) 

DEST[255:14]) ^ (SRC1 [255:240] + SRC2[255:240] + 1) » 1; 

VPAVGB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^]*8 

IF k1 [j] OR *no writemask* 

THEN DEST[i+7:i] ^ (SRC1 [i+7:i] + SRC2[i+7:i] + 1) » 1; (* Temp sum before shifting is 9 bits *) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

VPAVGW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k1 [j] OR *no writemask* 

THEN DEST[i+15:i] ^ (SRC1 [i+15:i] + SRC2[i+15:i] + 1) » 1 

; (* Temp sum before shifting is 17 bits *) 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 
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Intel C/C++ Compiler Intrinsic Equivalents 

VPAVGB_mSI Zi _mm51 Z_avg_epu8(_m5121 a,_mSI 21 b); 

VPAVGW _m5121 _mm512_avg_epu16( _m512i a, _m5121 b); 

VPAVGB_mSI 21 _mm512_mask_avg_epu8(_mSI 21 s,_mmask64 m,_mSI 21 a,_m512i b); 

VPAVGW_m512i_mm512_mask_avg_epu16(_m512i s,_mmask32 m,_mSI 21 a,_mSI 21 b); 

VPAVGB_mSI 21 _mm512_maskz_avg_epu8(_mmask64 m,_m512i a,_mSIZI b); 

VPAVGW_mSI 21 _mm512_maskz_avg_epu16(_mmaskBZ m,_mSIZI a,_mSIZI b); 

VPAVGB_m256l _mm256_mask_avg_epu8(_mZSGI s,_mmask32 m,_m256i a,_m256i b); 

VPAVGW_mZSGI _mm256_mask_avg_epu16(_m256i s,_mmaski 6 m,_m256i a,_m256l b); 

VPAVGB_m256l _mm256_maskz_avg_epu8(_mmask32 m,_m256i a,_mZSGI b); 

VPAVGW_mZSGI _mm2SG_maskz_avg_epu1 G(_mmaski G m,_mZSGI a,_mZSGI b); 

VPAVGB_ml 281 _mm_mask_avg_epu8(_ml 281 s,_mmaski G m,_ml 28i a, ml 281 b); 

VPAVGW_ml 28l_mm_mask_avg_epu1 G(_ml 281 s, mmaskS m,_ml 281 a, ml 281 b); 

VPAVGB_ml 281 _mm_maskz_avg_epu8(_mmaski G m, ml 281 a,_ml 281 b); 

VPAVGW_ml 28i_mm_maskz_avg_epu1G(_mmaskS m, ml 281 a,_ml 281 b); 

PAVGB:_mG4 _mm_avg_pu8 (_mG4 a,_mG4 b) 

PAVGW:_mG4 _mm_avg_pu1 G (_mG4 a,_mG4 b) 

(V)PAVGB:_ml 281 _mm_avg_epu8 (_ml 281 a,_ml 281 b) 

(V)PAVGW:_ml 281 _mm_avg_epu1 G (_ml 281 a,_ml 281 b) 

VPAVGB: _mZSGI _mm2SG_avg_epu8 (_mZSGI a,_mZSGI b) 

VPAVGW: _mZSGI _mm2SG_avg_epu1 G (_mZSGI a,_mZSGI b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PBLENDVB — Variable Blend Packed Bytes 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 10/r 

PBLENDVB xmm7, xmmZ/m7Z8, <XMM0> 

RM 

V/V 

SSE4_1 

Select byte values from xmm 7 and 
xmmZ/ml28 from mask specified in the high 
bit of each byte in XMMO and store the 
values into xmmh 

VEX.NDS.128.66.0F3A.W0 4C /r /is4 

VPBLENDVB xmm 7, xmmZ, xmm3/m 7 28, xmm4 

RVMR 

v/v 

AVX 

Select byte values from xmmZ and 
xmm3/m72’8using mask bits in the specified 
mask register, xmm4, and store the values 
into xmm 7. 

VEX.NDS.256.66.0F3A.W0 4C /r /is4 

VPBLENDVB ymm 1, ymmZ, \/mm3/m256, ymm4 

RVMR 

V/V 

AVX2 

Select byte values from ymmZ and 
ymm3/m256 from mask specified in the high 
bit of each byte in ymm4 and store the 
values into ymm7. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

<XMM0> 

NA 

RVMR 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8[7:4] 


Description 

Conditionally copies byte elements from the source operand (second operand) to the destination operand (first 
operand) depending on mask bits defined in the implicit third register argument, XMMO. The mask bits are the most 
significant bit in each byte element of the XMMO register. 

If a mask bit is "1", then the corresponding byte element in the source operand is copied to the destination, else 
the byte element in the destination operand is left unchanged. 

The register assignment of the implicit third operand is defined to be the architectural register XMMO. 

128-bit Legacy SSE version: The first source operand and the destination operand is the same. Bits (VLMAX-1:128) 
of the corresponding VMM destination register remain unchanged. The mask register operand is implicitly defined 
to be the architectural register XMMO. An attempt to execute PBLENDVB with a VEX prefix will cause #UD. 

VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second 
source operand is an XMM register or 128-bit memory location. The mask operand is the third source register, and 
encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is 
ignored. The upper bits (VLMAX-1:128) of the corresponding VMM register (destination register) are zeroed. VEX.L 
must be 0, otherwise the instruction will #UD. VEX.W must be 0, otherwise, the instruction will #UD. 

VEX.256 encoded version: The first source operand and the destination operand are VMM registers. The second 
source operand is an VMM register or 256-bit memory location. The third source register is an VMM register and 
encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is 
ignored. 

VPBLENDVB permits the mask to be any XMM or VMM register. In contrast, PBLENDVB treats XMMO implicitly as the 
mask and do not support non-destructive destination operation. An attempt to execute PBLENDVB encoded with a 
VEX prefix will cause a #UD exception. 

Operation 

PBLENDVB (1 Z8-bit Legacy SSE version) 

MASK ^ XMMO 

IF (MASK[7] = 1) THEN DEST[7:0] ^ SRC[7:0]; 

ELSE DEST[7:0] ^ DEST[7:0]; 

IF (MASK[15] = 1) THEN DEST[15:8] ^ SRC[15:8]; 
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ELSE DEST[15:8] ^ DEST[15:8]; 

IF (MASK[23] = 1) THEN DEST[23:16] ^ SRC[23:16] 

ELSE DEST[23:16] ^ DEST[23:16]; 

IF (MASK[31 ] = 1) THEN DEST[31:24]^ SRC[31:24] 

ELSE DEST[31:24] ^ DEST[31:24]; 

IF (MASK[39] = 1) THEN DEST[39:32] ^ SRC[39:32] 

ELSE DEST[39:32] ^ DEST[39:32]; 

IF (MASK[47] = 1) THEN DEST[47:40] ^ SRC[47:40] 

ELSE DEST[47:40] ^ DEST[47:40]; 

IF (MASK[55] = 1) THEN DEST[55:48] ^ SRC[55:48] 

ELSE DEST[55:48] ^ DEST[55:48]; 

IF (MASK[63] = 1) THEN DEST[63:56] ^ SRC[63:56] 

ELSE DEST[63:56] ^ DEST[63:56]; 

IF (MASK[71 ] = 1) THEN DEST[71:64] ^ SRC[71:64] 

ELSE DEST[71:64] ^ DEST[71:64]; 

IF (MASK[79] = 1) THEN DEST[79:72] ^ SRC[79:72] 

ELSE DEST[79:72] ^ DEST[79:72]; 

IF (MASK[87] = 1) THEN DEST[87:80] ^ SRC[87:80] 

ELSE DEST[87:80] ^ DEST[87:80]; 

IF (MASK[95] = 1) THEN DEST[95:88] ^ SRC[95:88] 

ELSE DEST[95:88] ^ DEST[95:88]; 

IF (MASK[103] = 1) THEN DEST[103:96] ^ SRC[103:96] 
ELSE DEST[103:96] ^ DEST[103:96]; 

IF (MASK[111 ] = 1) THEN DEST[111:104] ^ SRC[111:104] 
ELSE DEST[111:104] ^ DEST[111:104]; 

IF (MASK[119] = 1) THEN DEST[119:112] ^ SRC[119:112] 
ELSE DEST[119:112] ^ DEST[119:112]; 

IF (MASK[127] = 1) THEN DEST[127:120] ^ SRC[127:120] 
ELSE DEST[127:120] ^ DEST[127:120]) 

DEST[VLMAX-1:128] (Unmodified) 

VPBLENDVB (VEX.128 encoded version) 

MASK ^ SRC3 

IF (MASK[7] = 1) THEN DEST[7:0] ^ SRC2[7:0]; 

ELSE DEST[7:0] ^SRC1[7:0]; 

IF (MASK[15] = 1) THEN DEST[15:8] ^ SRC2[15:8]; 

ELSE DEST[15:8] ^ SRC1 [15:8]; 

IF (MASK[23] = 1) THEN DEST[23:16] ^ SRC2[23:16] 

ELSE DEST[23:16] ^ SRC1 [23:16]; 

IF (MASK[31 ] = 1) THEN DEST[31:24]^ SRC2[31:24] 

ELSE DEST[31:24] ^ SRC1 [31:24]; 

IF (MASK[39] = 1) THEN DEST[39:32] ^ SRC2[39:32] 

ELSE DEST[39:32] ^ SRC1 [39:32]; 

IF (MASK[47] = 1) THEN DEST[47:40] ^ SRC2[47:40] 

ELSE DEST[47:40] ^ SRC1 [47:40]; 

IF (MASK[55] = 1) THEN DEST[55:48] ^ SRC2[55:48] 

ELSE DEST[55:48] ^ SRC1 [55:48]; 

IF (MASK[63] = 1) THEN DEST[63:56] ^ SRC2[63:56] 

ELSE DEST[63:56] ^ SRC1 [63:56]; 

IF (MASK[71 ] = 1) THEN DEST[71:64] ^ SRC2[71:64] 

ELSE DEST[71:64] ^ SRC1 [71:64]; 

IF (MASK[79] = 1) THEN DEST[79:72] ^ SRC2[79:72] 

ELSE DEST[79:72] ^ SRC1 [79:72]; 

IF (MASK[87] = 1) THEN DEST[87:80] ^ SRC2[87:80] 
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ELSE DEST[87:80] ^ SRC1 [87:80]; 

IF (MASK[95] = 1) THEN DEST[95:88] ^ SRC2[95:88] 

ELSE DEST[95:88] ^ SRC1 [95:88]; 

IF (MASK[103] = 1) THEN DEST[103:96] ^ SRC2[103:96] 
ELSE DEST[103:96] ^ SRC1 [103:96]; 

IF (MASK[111 ] = 1) THEN DEST[111:104] ^ SRC2[111:104] 
ELSE DEST[111:104] ^ SRC1 [111:104]; 

IF (MASK[119] = 1) THEN DEST[119:11 2] ^ SRC2[119:112] 
ELSE DEST[119:112] ^ SRC1 [119:112]; 

IF (MASK[127] = 1) THEN DEST[127:120] ^ SRC2[127:120] 
ELSE DEST[127:120] ^ SRC1 [127:120]) 
DEST[VLMAX-1:128]^0 

VPBLENDVB (VEX.256 encoded version) 

MASK ^ SRC3 

IF (MASK[7] == 1) THEN DEST[7:0] ^ SRC2[7:0]; 

ELSE DEST[7:0]^SRC1[7:0]; 

IF (MASK[15] == 1) THEN DEST[15:8] ^SRC2[15:8]; 

ELSE DEST[15:8] ^ SRC1 [15:8]; 

IF (MASK[23] == 1) THEN DEST[23:16] ^SRC2[23:16] 

ELSE DEST[23:16] ^ SRC1 [23:16]; 

IF(MASK[31] == 1)THEN DEST[31:24] ^ SRC2[31:24] 

ELSE DEST[31:24] ^ SRC1 [31:24]; 

IF (MASK[39] == 1) THEN DEST[39:32] ^ SRC2[39:32] 

ELSE DEST[39:32] ^ SRC1 [39:32]; 

IF (MASK[47] == 1) THEN DEST[47:40] ^ SRC2[47:40] 

ELSE DEST[47:40] ^ SRC1 [47:40]; 

IF (MASK[55] == 1) THEN DEST[55:48] ^ SRC2[55:48] 

ELSE DEST[55:48] ^ SRC1 [55:48]; 

IF (MASK[63] == 1) THEN DEST[63:56] ^SRC2[63:56] 

ELSE DEST[63:56] ^ SRC1 [63:56]; 

IF (MASK[71 ] == 1) THEN DEST[71:64] ^SRC2[71:64] 

ELSE DEST[71:64] ^ SRC1 [71:64]; 

IF (MASK[79] == 1) THEN DEST[79:72] ^ SRC2[79:72] 

ELSE DEST[79:72] ^ SRC1 [79:72]; 

IF (MASK[87] == 1) THEN DEST[87:80] ^ SRC2[87:80] 

ELSE DEST[87:80] ^ SRC1 [87:80]; 

IF (MASK[95] == 1) THEN DEST[95:88] ^ SRC2[95:88] 

ELSE DEST[95:88] ^ SRC1 [95:88]; 

IF (MASK[103] == 1) THEN DEST[103:96] ^ SRC2[103:96] 
ELSE DEST[103:96] ^ SRC1 [103:96]; 

IF (MASK[111 ] == 1) THEN DEST[111:104] ^ SRC2[111:104] 
ELSE DEST[111:104] ^ SRC1 [111:104]; 

IF (MASK[119] == 1) THEN DEST[119:112] ^ SRC2[119:112] 
ELSE DEST[119:112] ^ SRC1 [119:112]; 

IF (MASK[1 27] == 1) THEN DEST[127:120] ^ SRC2[127:120] 
ELSE DEST[127:120] ^ SRC1 [127:120]) 

IF (MASK[1 35] == 1) THEN DEST[135:128] ^ SRC2[135:128]; 
ELSE DEST[135:128] ^ SRC1 [135:128]; 

IF (MASK[143] == 1) THEN DEST[143:136] ^ SRC2[143:136]; 
ELSE DEST[[143:136] ^ SRC1 [143:136]; 

IF (MASK[1 51 ] == 1) THEN DEST[151:144] ^ SRC2[151:144] 
ELSE DEST[151:144] ^ SRC1 [151:144]; 

IF (MASK[1 59] == 1) THEN DEST[159:152] ^ SRC2[159:152] 
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ELSE DEST[159:152] ^ SRC1 [159:152]; 

IF (MASK[167] == 1) THEN DEST[167:160] 
ELSE DEST[167:160] ^ SRC1 [167:160]; 

IF (MASK[175] == 1) THEN DEST[175:168] 
ELSE DEST[175:168] ^ SRC1 [175:168]; 

IF (MASK[183] == 1) THEN DEST[183:176] 
ELSE DEST[183:176] ^ SRC1 [183:176]; 

IF (MASK[191 ] == 1) THEN DEST[191:184] 
ELSE DEST[191:184] ^ SRC1 [191:184]; 

IF (MASK[199] == 1) THEN DEST[199:192] 
ELSE DEST[199:192] ^ SRC1 [199:192]; 

IF (MASK[207] == 1) THEN DEST[207:200] 
ELSE DEST[207:200] ^ SRC1 [207:200] 

IF (MASK[215] == 1) THEN DEST[215:208] 
ELSE DEST[215:208] ^ SRC1 [215:208]; 

IF (MASK[223] == 1) THEN DEST[223:216] 
ELSE DEST[223:216] ^ SRC1 [223:216]; 
IF(MASK[231]== 1)THEN DEST[231:224] 
ELSE DEST[231:224] ^ SRC1 [231:224]; 

IF (MASK[239] == 1)THEN DEST[239:232] 
ELSE DEST[239:232] ^ SRC1 [239:232]; 

IF (MASK[247] == 1)THEN DEST[247:240] 
ELSE DEST[247:240] ^ SRC1 [247:240]; 

IF (MASK[255] == 1)THEN DEST[255:248] 
ELSE DEST[255:248] ^ SRC1 [255:248] 


^SRC2[167:160] 
^SRC2[175:168] 
^SRC2[183:176] 
^SRC2[191:184] 
^SRC2[199:192] 
^ SRC2[207:200] 
^SRC2[215:208] 
^SRC2[223:216] 
^SRC2[231:224] 
^ SRC2[239:232] 
^ SRC2[247:240] 
^ SRC2[255:248] 


Intel C/C++ Compiler Intrinsic Equivalent 

(VjPBLENDVB: ml 281 _mm_blendv_epl8 ( ml 281 v1, ml 281 v2, ml 28i mask); 

VPBLENDVB: _m256i _mm256_blendv_epl8 ( m256i v1, m256l v2, m256l mask); 


Flags Affected 

None. 


SIMD Floating-Point Exceptions 

None. 


Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.W=l. 
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PBLENDW - Blend Packed Words 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A OE /r ib 

PBLENDW xmmi, xmm2/ml28, imm8 

RMI 

V/V 

SSE4_1 

Select words from xmml and xmm2/ml28 
from mask specified in imm8 and store the 
values into xmml. 

VEX.NDS.128.66.0F3A.WIG OE /r ib 

VPBLENDW xmml, xmm2, xmm3/ml28, imm8 

RVMI 

v/v 

AVX 

Select words from xmm2 and xmm3/m 128 
from mask specified in imm8 and store the 
values into xmml. 

VEX.NDS.256.66.0F3A.WIG OE /r ib 

VPBLENDW ymmi, ymm2, ymm3/m256, imm8 

RVMI 

V/V 

AVX2 

Select words from ymm2 and ymm3/m256 
from mask specified in imm8 and store the 
values into ymmh 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

ImmB 


Description 

Words from the source operand (second operand) are conditionally written to the destination operand (first 
operand) depending on bits in the immediate operand (third operand). The immediate bits (bits 7:0) form a mask 
that determines whether the corresponding word in the destination is copied from the source. If a bit in the mask, 
corresponding to a word, is "1", then the word is copied, else the word element in the destination operand is 
unchanged. 

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM destination 
register remain unchanged. 

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM register 
are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. 

Operation 

PBLENDW (128-bit Legacy SSE version) 

IF (Imm8[0] = 1) THEN DEST[15:0] ^ SRC[15:0] 

ELSE DEST[15:0] ^ DEST[15:0] 

IF (Imm8[1 ] = 1) THEN DEST[31:16] ^ SRC[31:16] 

ELSE DEST[31:16] ^ DEST[31:16] 

IF (Imm8[2] = 1) THEN DEST[47:32] ^ SRC[47:32] 

ELSE DEST[47:32] ^ DEST[47:32] 

IF (Imm8[3] = 1) THEN DEST[63:48] ^ SRC[63:48] 

ELSE DEST[63:48] ^ DEST[63:48] 

IF (Imm8[4] = 1) THEN DEST[79:64] ^ SRC[79:64] 

ELSE DEST[79:64] ^ DEST[79:64] 

IF (Imm8[5] = 1) THEN DEST[95:80] ^ SRC[95:80] 

ELSE DEST[95:80] ^ DEST[95:80] 

IF (Imm8[6] = 1) THEN DEST[111:96] ^ SRC[111:96] 

ELSE DEST[111:96] ^ DEST[111:96] 

IF (Imm8[7] = 1) THEN DEST[127:112] ^ SRC[127:112] 
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ELSE DEST[127:112] ^ DEST[127:112] 


VPBLENDW (VEX.128 encoded version) 

IF (imm8[0] = 1) THEN DEST[15:0] ^ SRC2[15:0] 

ELSE DEST[15:0] ^SRCI [15:0] 

IF (imm8[1 ] = 1) THEN DEST[31:16] ^ SRC2[31:16] 

ELSE DEST[31:16] ^ SRC1 [31:16] 

IF (imm8[2] = 1) THEN DEST[47:32] ^ SRC2[47:32] 

ELSE DEST[47:32] ^ SRC1 [47:32] 

IF (imm8[3] = 1) THEN DEST[63:48] ^ SRC2[63:48] 

ELSE DEST[63:48] ^ SRC1 [63:48] 

IF (imm8[4] = 1) THEN DEST[79:64] ^ SRC2[79:64] 

ELSE DEST[79:64] ^ SRC1 [79:64] 

IF (imm8[5] = 1) THEN DEST[95:80] ^ SRC2[95:80] 

ELSE DEST[95:80] ^ SRC1 [95:80] 

IF (imm8[6] = 1) THEN DEST[111:96] ^ SRC2[111:96] 
ELSE DEST[111:96] ^ SRC1 [111:96] 

IF (imm8[7] = 1) THEN DEST[127:112] ^ SRC2[127:112] 
ELSE DEST[127:112] ^ SRC1 [127:112] 
DEST[VLMAX-1:128]^0 

VPBLENDW (VEX.256 encoded version) 

IF (imm8[0] == 1) THEN DEST[15:0] ^ SRC2[15:0] 

ELSE DEST[15:0] ^ SRC1 [15:0] 

IF (imm8[1 ] == 1) THEN DEST[31:16] ^ SRC2[31:16] 

ELSE DEST[31:16] ^ SRC1 [31:16] 

IF (imm8[2] == 1) THEN DEST[47:32] ^ SRC2[47:32] 

ELSE DEST[47:32] ^ SRC1 [47:32] 

IF (imm8[3] == 1) THEN DEST[63:48] ^ SRC2[63:48] 

ELSE DEST[63:48] ^ SRC1 [63:48] 

IF (imm8[4] == 1) THEN DEST[79:64] ^ SRC2[79:64] 

ELSE DEST[79:64] ^ SRC1 [79:64] 

IF (imm8[5] == 1) THEN DEST[95:80] ^ SRC2[95:80] 

ELSE DEST[95:80] ^ SRC1 [95:80] 

IF (imm8[6] == 1) THEN DEST[111:96] ^ SRC2[111:96] 
ELSE DEST[111:96] ^ SRC1 [111:96] 

IF (imm8[7] == 1) THEN DEST[127:112] ^ SRC2[127:112] 
ELSE DEST[127:112] ^ SRC1 [127:112] 

IF (imm8[0] == 1) THEN DEST[143:128] ^ SRC2[143:128] 
ELSE DEST[143:128] ^ SRC1 [143:128] 

IF (imm8[1 ] == 1) THEN DEST[159:144] ^ SRC2[159:144] 
ELSE DEST[159:144] ^ SRC1 [159:144] 

IF (imm8[2] == 1) THEN DEST[175:160] ^ SRC2[175:160] 
ELSE DEST[175:160] ^ SRC1 [175:160] 

IF (imm8[3] == 1) THEN DEST[191:176] ^ SRC2[191:176] 
ELSE DEST[191:176] ^ SRC1 [191:176] 

IF (imm8[4] == 1) THEN DEST[207:192] ^ SRC2[207:192] 
ELSE DEST[207:192] ^ SRC1 [207:192] 

IF (imm8[5] == 1) THEN DEST[223:208] ^ SRC2[223:208] 
ELSE DEST[223:208] ^ SRC1 [223:208] 

IF (imm8[6] == 1) THEN DEST[239:224] ^ SRC2[239:224] 
ELSE DEST[239:224] ^ SRC1 [239:224] 

IF (imm8[7] == 1) THEN DEST[255:240] ^ SRC2[255:240] 
ELSE DEST[255:240] ^ SRC1 [255:240] 
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Intel C/C++ Compiler Intrinsic Equivalent 

(\/)PBLENDW: _ml 281 _mm_blend_epl16 (_ml 28i v1,_ml 28i v2, const Int mask); 

VPBLENDW: _m256l _mm256_blend_epl16 (_m256l v1,_m256l v2, const Int mask) 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 

#UD If VEX.L = 1 and AVX2 = 0. 
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PCLMULQDQ - Carry-Less Multiplication Quadword 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Fiag 

Description 

66 OF 3A 44 /r lb 

PCLMULQDQ xmml, xmm2/ml28, imm8 

RMI 

V/V 

PCLMUL¬ 

QDQ 

Carry-less multiplication of one quadword of 
xmml by one quadword of xmm2/m728, 
stores the 128-bit result in xmml. The imme¬ 
diate is used to determine which quadwords 
of xmm 1 and xmm2/m 128 should be used. 

VEX.NDS.128.66.0F3A.WIG 44 /r lb 

VPCLMULQDQ xmml, xmm2, xmm3/m128, imm8 

RVMI 

v/v 

Both PCL¬ 
MULQDQ 
and AVX 
flags 

Carry-less multiplication of one quadword of 
xmm2 by one quadword of xmm3/m 128, 
stores the 128-bit result in xmml. The imme¬ 
diate is used to determine which quadwords 
of xmm2 and xmm3/m 128 should be used. 


Instruction Operand 

Encoding 

Qp/En 

Operand 1 

0perand2 

0perand3 

0perand4 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 


Description 

Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand 
according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to 
use according to Table 4-13, other bits of the immediate byte are ignored. 


Table 4-13. PCLMULQDQ Quadword Selection of Immediate Byte 


lmm[4] 

lmm[0] 

PCLMULQDQ Operation 

0 

0 

CL MUL( SRC2^ [63:0], SRC1 [63:0]) 

0 

1 

CL_MUL( SRC2[63:0], SRC1 [127:64]) 

1 

0 

CL_MUL( SRC2[127:64], SRC1 [63:0]) 

1 

1 

CL_MUL( SRC2[127:64], SRC1 [127:64]) 


NOTES: 

1. SRC2 denotes the second source operand, which can be a register or memory; SRC1 denotes the first source and destination oper¬ 
and. 


The first source operand and the destination operand are the same and must be an XMM register. The second 
source operand can be an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding 
VMM destination register remain unchanged. 

Compilers and assemblers may implement the following pseudo-op syntax to simply programming and emit the 
required encoding for Imm8. 


Table 4-14. Pseudo-Op and PCLMULQDQ Implementation 


Pseudo-Op 

ImmS Encoding 

PCLMULLQLQDQ xmml, xmm2 

0000_0000B 

PCLMULHQLQDQ xmml, xmm2 

0000_0001B 

PCLMULLQHQDQ xmml, xmm2 

0001_0000B 

PCLMULHQHQDQ xmml, xmm2 

0001_0001B 
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Operation 

PCLMULQDQ 

IF (Imm8[0] = 0 ) 

THEN 

TEMPI ^ SRC1 [63:0]; 

ELSE 

TEMPI ^SRCI [127:64]; 

FI 

IF (Imm8[4] = 0 ) 

THEN 

TEMP2 ^ SRC2 [63:0]; 

ELSE 

TEMP2 ^ SRC2 [127:64]; 

FI 

For i = 0 to 63 [ 

TmpB [ i ] ^ (TEMPI [ 0 ] and TEMP2[ i ]); 

For] = 1 to i [ 

TmpB [ i ] ^ TmpB [ i ] xor (TEMPI [ j ] and TEMP2[ i - j ]) 

} 

DEST[i] ^ TmpB[i]; 

} 

For i = 64 to 126 [ 

TmpB [ i ] <- 0; 

For j = i - 63 to 63 [ 

TmpB [ i ] ^ TmpB [ i ] xor (TEMPI [ j ] and TEMP2[ i - j ]) 

} 

DEST[i]^TmpB[i]; 

} 

DEST[127] ^ 0; 

DEST[VLMAX-1:128] (Unmodified) 

VPCLMULQDQ 

IF (lmm8[0] = 0) 

THEN 

TEMPI ^ SRC1 [63:0]; 

ELSE 

TEMPI ^SRCI [127:64]; 

FI 

IF (Imm8[4] = 0) 

THEN 

TEMP2 ^ SRC2 [63:0]; 

ELSE 

TEMP2 ^ SRC2 [127:64]; 

FI 

For I = 0 to 63 [ 

TmpB [ i ] ^ (TEMPI [ 0 ] and TEMP2[ I ]); 

For] = 1 to I ( 

TmpB [I] ^ TmpB [I] xor (TEMPI [ j ] and TEMP2[ i - ] ]) 

} 

DEST[i] ^ TmpB[l]; 

} 

For i = 64 to 126 { 

TmpB [ i ] <- 0; 

For] = I - 63 to 63 [ 
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TmpB [i] ^ TmpB [I] xor (TEMPI [ j ] and TEMP2[ i - j ]) 

} 

DEST[I] ^ TmpB[i]; 

} 

DEST[VLMAX-1:127]^0; 

Intel C/C++ Compiler Intrinsic Equivalent 

(VjPCLMULQDQ: _ml 28i _mm_clmulepl64_si128 (_ml 28i,_ml 281, const int) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4, additionally 
#UD IfVEX.L=l. 
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PCMPEQB/PCMPEQW/PCMPEQD- Compare Packed Data for Equal 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 74 /r' 

PCMPEQB mm, mm/m64 

RM 

V/V 

MMX 

Compare packed bytes In mm/m64 and mm 
for equality. 

66 OF 74 /r 

PCMPEQB xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Compare packed bytes In xmm2/m 128 and 
xmml for equality. 

OF 75 /r' 

PCMPEQW mm, mm/m64 

RM 

V/V 

MMX 

Compare packed words In mm/m64 and mm 
for equality. 

66 OF 75 /r 

PCMPEQW xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Compare packed words in xmm2/m 128 and 
xmml for equality. 

OF 76 /r' 

PCMPEQD mm, mm/m64 

RM 

v/v 

MMX 

Compare packed doublewords in mm/m64 and 
mm for equality. 

66 OF 76 /r 

PCMPEQD xmml, xmm2/m128 

RM 

v/v 

SSE2 

Compare packed doublewords in xmm2/ml28 
and xmml for equality. 

VEX.NDS.128.66.0F.WIG74 /r 

VPCMPEQB xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Compare packed bytes in xmm3/ml28 and 
xmm2 for equality. 

VEX.NDS.128.66.0F.WIG75 /r 

VPCMPEQW xmml, xmm2, xmm3/ml28 

RVM 

v/v 

AVX 

Compare packed words in xmm3/ml28and 
xmm2 for equality. 

VEX.NDS.128.66.0F.WIG76 /r 

VPCMPEQD xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Compare packed doublewords in xmm3/ml28 
and xmm2 for equality. 

VEX.NDS.256.66.0F.WIG 74 /r 

VPCMPEQB ymmi, ymm2, ymm3 /m256 

RVM 

v/v 

AVX2 

Compare packed bytes in ymm3/m256 and 
ymm2 for equality. 

VEX.NDS.256.66.0F.WIG 75 /r 

VPCMPEQW ymm 1, ymm2, ymm3 /m256 

RVM 

v/v 

AVX2 

Compare packed words in ymm3/m256 and 
ymm2 for equality. 

VEX.NDS.256.66.0F.WIG 76 /r 

VPCMPEQD ymm 1, ymm2, ymm3 /m256 

RVM 

v/v 

AVX2 

Compare packed doublewords in ymm3/m256 
and ymm2 for equality. 

EVEX.NDS.128.66.0F.W0 76 /r 

VPCMPEQD k1 {k2}, xmm2, xmm3/m128/m32bcst 

FV 

v/v 

AVX512V 

L 

AVX512F 

Compare Equal between int32 vector xmm2 
and int32 vector xmm3/m128/m32bcst, and 
set vector mask k1 to reflect the 
zero/nonzero status of each element of the 
result, under writemask. 

EVEX.NDS.256.66.0F.W0 76 /r 

VPCMPEQD k1 [k2}, ymm2, ymm3/m256/m32bcst 

FV 

v/v 

AVX512V 

L 

AVX512F 

Compare Equal between int32 vector ymm2 
and int32 vector ymm3/m256/m32bcst, and 
set vector mask k1 to reflect the 
zero/nonzero status of each element of the 
result, under writemask. 

EVEX.NDS.512.66.0F.W0 76 /r 

VPCMPEQD k1 [k2}, zmm2, zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Compare Equal between int32 vectors in 
zmm2 and zmm3/m512/m32bcst, and set 
destination k1 according to the comparison 
results under writemask k2. 

EVEX.NDS.1 28.66.0F.WIG 74 /r 

VPCMPEQB k1 {k2}, xmm2, xmm3 /ml 28 

FVM 

v/v 

AVX512V 

L 

AVX512B 

W 

Compare packed bytes in xmm3/m128 and 
xmm2 for equality and set vector mask k1 to 
reflect the zero/nonzero status of each 
element of the result, under writemask. 


4-244 Vol. 2B 


PCMPEQB/PCMPEQW/PCMPEQD- Compare Packed Data for Equal 























INSTRUCTION SET REFERENCE, M-U 


EVEX.NDS.256.66.0F.WIG 74 /r 

VPCMPEQB k1 {k2}, ymm2, ymm3 /m256 

FVM 

V/V 

AVX512V 

L 

AVX512B 

W 

Compare packed bytes In ymm3/m256 and 
ymm2 for equality and set vector mask k1 to 
reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.512.66.0F.WIG 74 /r 

VPCMPEQB k1 {k2}, zmm2, zmm3 /m512 

FVM 

V/V 

AVX512B 

W 

Compare packed bytes In zmm3/m512 and 
zmm2 for equality and set vector mask k1 to 
reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.128.66.0F.WIG 75 /r 

VPCMPEQW k1 [k2}, xmm2, xmm3 /ml 28 

FVM 

V/V 

AVX512V 

L 

AVX512B 

W 

Compare packed words in xmm3/m128 and 
xmm2 for equality and set vector mask k1 to 
reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.256.66.0F.WIG 75 /r 

VPCMPEQW k1 {k2},ymm2,ymm3/m256 

FVM 

V/V 

AVX512V 

L 

AVX512B 

W 

Compare packed words In ymm3/m256 and 
ymm2 for equality and set vector mask k1 to 
reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.512.66.0F.WIG 75 /r 

VPCMPEQW k1 [k2}, zmm2, zmm3 /m512 

FVM 

V/V 

AVX512B 

W 

Compare packed words In zmm3/m512 and 
zmm2 for equality and set vector mask k1 to 
reflect the zero/nonzero status of each 
element of the result, under writemask. 


NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel* 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel* 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Qperand 2 

Qperand 3 

Qperand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare for equality of the packed bytes, words, or doublewords in the destination operand (first 
operand) and the source operand (second operand). If a pair of data elements is equal, the corresponding data 
element in the destination operand is set to all Is; otherwise, it is set to all Os. 

The (V)PCMPEQB instruction compares the corresponding bytes in the destination and source operands; the 
(V)PCMPEQW instruction compares the corresponding words in the destination and source operands; and the 
(V)PCMPEQD instruction compares the corresponding doublewords in the destination and source operands. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM destination 
register remain unchanged. 

VEX. 128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM register 
are zeroed. 
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VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. 

EVEX encoded VPCMPEQD: The first source operand (second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 32-bit memory location. The destination operand (first operand) is a mask register updated 
according to the writemask k2. 

EVEX encoded VPCMPEQB/W: The first source operand (second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand 
(first operand) is a mask register updated according to the writemask k2. 

Operation 

PCMPEQB (with 64-bit operands) 

IF DEST[7:0] = SRC[7:0] 

THEN DEST[7:0) ^ FFH; 

ELSE DEST[7:0] ^ 0; FI; 

(* Continue comparison of 2nd through 7th bytes in DEST and SRC *) 

IF DEST[63:56] = SRC[63:56] 

THEN DEST[63:56] ^ FFH; 

ELSE DEST[63:56] ^ 0; FI; 

COMPARE_BYTES_EQUAL (SRC1, SRC2) 

IF SRC1 [7:0] = SRC2[7:0] 

THEN DEST[7:0] ^FFH; 

ELSE DEST[7:0] ^0; FI; 

(* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *) 

IF SRC1 [127:120] = SRC2[127:120] 

THEN DEST[127:120] ^FFH; 

ELSE DEST[127:120] ^0; FI; 

COMPARE_WORDS_EQUAL (SRC1, SRC2) 

IFSRC1[15:0] = SRC2[15:0] 

THEN DEST[15:0] ^FFFFH; 

ELSE DEST[15:0] ^0; FI; 

(* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *) 

IFSRC1[127:112] = SRC2[127:112] 

THEN DEST[127:112] ^FFFFH; 

ELSE DEST[127:112] ^0; FI; 

COMPARE_DWORDS_EQUAL (SRC1, SRC2) 

IFSRC1[31:0] = SRC2[31:0] 

THEN DEST[31:0] ^FFFFFFFFH; 

ELSE DEST[31:0] ^0; FI; 

(* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *) 

IF SRC1 [127:96] = SRC2[127:96] 

THEN DEST[1 27:96] ^FFFFFFFFH; 

ELSE DEST[127:96] ^0; FI; 


PCMPEQB (with 128-bit operands) 

DEST[127:0] ^C0MPARE_BYTES_EQUAL(DEST[127:0],SRC[127:0]) 
DEST[MAX_VL-1:128] (Unmodified) 

VPCMPEQB (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_BYTES_EQUAL(SRC1 [127:0],SRC2[127:0]) 
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DEST[VLMAX-1:128]^0 


VPCMPEQB (VEX.256 encoded version) 

DEST[127:0] ^C0MPARE_BYTES_EQUAL(SRC1 [127:0],SRC2[127:0]) 
DEST[255:128] ^C0MPARE_BYTES_EQUAL(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 


VPCMPEQB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^]*8 

IF k20] OR *no wrltemask* 

THEN 

/* signed comparison */ 

CMP ^ SRC1 [i+7:i] == SRC2[i+7:i]; 

IF CMP = TRUE 

THEN DEST[j] ^ 1; 

ELSE DESTG] ^ 0; FI; 

ELSE DEST[j] <- 0 ; zeroing-masking onlyFI; 

FI; 

ENDFOR 

DEST[MAX_KL-1:KL] ^0 

PCMPEQW (with 64-bit operands) 

IF DEST[15:0] = SRC[15:0] 

THEN DEST[15:0]^FFFFH; 

ELSE DEST[15:0] ^ 0; FI; 

(* Continue comparison of 2nd and 3rd words in DEST and SRC *) 

IF DEST[63:48] = SRC[63:48] 

THEN DEST[63:48] ^ FFFFH; 

ELSE DEST[63:48] ^ 0; FI; 

PCMPEQW (with 128-bit operands) 

DEST[127:0] ^COMPARE_WORDS_EQUAL(DEST[127:0],SRC[127:0]) 
DEST[MAX_VL-1:128] (Unmodified) 


VPCMPEQW (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_W0RDS_EQUAL(SRC1 [127:0],SRC2[127:0]) 
DEST[VLMAX-1:128]^0 


VPCMPEQW (VEX.256 encoded version) 

DEST[127:0] ^C0MPARE_W0RDS_EQUAL(SRC1 [127:0],SRC2[127:0]) 
DEST[255:128] ^C0MPARE_W0RDS_EQUAL(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 


VPCMPEQW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
I ^]* 16 

IF k20] OR *no wrltemask* 

THEN 

/* signed comparison */ 

CMP ^ SRC1 [i+15:1] == SRC2[i+15:1]; 
IF CMP = TRUE 
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THEN DESTO]^ 1; 

ELSE DEST[j] ^ 0; FI; 

ELSE DEST[j] <- 0 ; zeroing-masking onlyFI; 

FI; 

ENDFOR 

DEST[MAX_KL-1 :KL] ^ 0 

PCMPEQD (with 64-bit operands) 

IFDEST[31:0] = SRC[31:0] 

THEN DEST[31:0] ^ FFFFFFFFH; 

ELSE DEST[31:0]^0; FI; 

IF DEST[63:32] = SRC[63:32] 

THEN DEST[63:32] ^ FFFFFFFFH; 

ELSE DEST[63:32] ^ 0; FI; 

PCMPEQD (with 128-bit operands) 

DEST[127:0] ^C0MPARE_DW0RDS_EQUAL(DEST[127:0],SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

VPCMPEQD (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_DW0RDS_EQUAL(SRC1 [127:0],SRC2[127:0]) 

DEST[VLMAX-1:128]^0 

VPCMPEQD (VEX.256 encoded version) 

DEST[127:0] ^COMPARE_DWORDS_EQUAL(SRC1 [127:0],SRC2[127:0]) 

DEST[255:128] ^COMPARE_DWORDS_EQUAL(SRC1 [255:1 28],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 

VPCMPEQD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k2[j] OR *no writemask* 

THEN 

/* signed comparison */ 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN CMP ^ SRC1 [1+31 :i] = SRC2[31:0]; 

ELSE CMP ^ SRC1 [1+31 :l] = SRC2[i+31 :i]; 

FI; 

IF CMP = TRUE 

THEN DEST[j]^ 1; 

ELSE DEST[j] ^ 0; FI; 

ELSE DEST[j] <- 0 ; zeroing-masking only 

FI; 

ENDFOR 

DEST[MAX_KL-1 :KL] ^ 0 

Intel C/C-r-i- Compiler Intrinsic Equivalents 

VPCMPEQB_mmask64 _mm512_cmpeq_epi8_mask(_m512i a,_m512i b); 

VPCMPEQB_mmask64_mm512_mask_cmpeq_epi8_mask(_mmask64 k,_m512i a,_m512i b); 

VPCMPEQB_mmask32 _mm256_cmpeq_epi8_mask(_m256i a,_m256i b); 

VPCMPEQB_mmask32 _mm256_mask_cmpeq_epi8_mask(_mmask32 k,_m256i a,_m256i b); 

VPCMPEQB_mmaski 6 _mm_cmpeq_epi8_mask(_ml 28i a,_ml 28i b); 

VPCMPEQB_mmaski 6 _mm_mask_cmpeq_epi8_mask(_mmaski 6 k,_ml 28i a,_ml 28i b); 
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VPCMPEQW_mmask32 _mm512_cmpeq_epi16_mask(_mSI 2i a,_mSI 21 b); 

VPCMPEQW_mmask32 _mm512_mask_cmpeq_epl16_mask(_mmask32 k,_mSI 21 a,_mSI 21 b); 

VPCMPEQW_mmaski 6 _mm256_cmpeq_epi16_mask(_m256i a,_m256l b); 

VPCMPEQW_mmaski 6 _mm256_mask_cmpeq_epl16_mask(_mmaski 6 k,_m256l a,_m256l b); 

VPCMPEQW_mmaskS _mm_cmpeq_epi16_mask(_ml 281 a,_ml 281 b); 

VPCMPEQW_mmask8 _mm_mask_cmpeq_epi16_mask(_mmask8 k,_ml 281 a,_ml 28i b); 

VPCMPEQD_mmaski 6 _mm512_cmpeq_epi32_mask(_m512l a,_m512i b); 

VPCMPEQD_mmaski 6 _mm512_mask_cmpeq_epl32_mask(_mmaski 6 k,_mSI 2i a,_mSI 21 b); 

VPCMPEQD_mmask8 _mm256_cmpeq_epi32_mask(_m256l a,_m256i b); 

VPCMPEQD_mmask8 _mm256_mask_cmpeq_epi32_mask(_mmask8 k,_m256i a,_m256l b); 

VPCMPEQD_mmask8 _mm_cmpeq_epl32_mask(_ml 281 a,_ml 281 b); 

VPCMPEQD_mmask8_mm_mask_cmpeq_epl32_mask(_mmask8 k,_ml 281 a,_ml 281 b); 

PCMPEQB:_m64 _mm_cmpeq_pl8 (_m64 ml,_m64 m2) 

PCMPEQW:_m64_mm_cmpeq_pi16 (_m64 ml,_m64 m2) 

PCMPEQD:_m64 _mm_cmpeq_pi32 (_m64 ml,_m64 m2) 

(V)PCMPEQB:_m128i_mm_cmpeq_epl8 (_ml 281 a,_ml 281 b) 

(V)PCMPEQW:_ml 281 _mm_cmpeq_epl16 (_ml 281 a,_ml 281 b) 

(V)PCMPEQD:_ml 281 _mm_cmpeq_epl32 (_ml 28i a,_ml 281 b) 

VPCMPEQD: _m256l _mm256_cmpeq_epi8 (_m256l a,_m256i b) 

VPCMPEQW: _m256l _mm256_cmpeq_epi16 ( m256l a, m256i b) 

VPCMPEQD: _m256l _mm256_cmpeq_epi32 ( m256l a, m256i b) 


Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded VPCMPEQD, see Exceptions Type E4. 
EVEX-encoded VPCMPEQB/W, see Exceptions Type E4.nb. 
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PCMPEQQ — Compare Packed Qword Data for Equal 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 29/r 

PCMPEQQ xmm 1, xmmZ/m 1Z8 

RM 

V/V 

SSE4_1 

Compare packed qwords in xmmZ/mlZ8an6 
xmmi for equality. 

VEX.NDS.128.66.0F38.WIG 29 /r 

VPCMPEQQ xmm 1, xmmZ, xmm3/m 1Z8 

RVM 

v/v 

AVX 

Compare packed quadwords in xmm3/mlZ8 
and xmmZ for equality. 

VEX.NDS.256.66.0F38.WIG 29 /r 

VPCMPEQQ ymmi, ymmZ, ymm3 /mZ56 

RVM 

V/V 

AVX2 

Compare packed quadwords in ymm3/mZ56 
and ymmZ for equality. 

EVEX.NDS.128.66.0F38.W1 29/r 

VPCMPEQQ k1 (k2}, xmm2, xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare Equal between int64 vector xmm2 
and int64 vector xmm3/m128/m64bcst, and 
set vector mask k1 to reflect the zero/nonzero 
status of each element of the result, under 
writemask. 

EVEX.NDS.256.66.0F38.W1 29 /r 

VPCMPEQQ k1 [k2}, ymm2, ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare Equal between int64 vector ymm2 
and int64 vector ymm3/m256/m64bcst, and 
set vector mask k1 to reflect the zero/nonzero 
status of each element of the result, under 
writemask. 

EVEX.NDS.512.66.0F38.W1 29/r 

VPCMPEQQ k1 [k2}, zmm2, zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Compare Equal between int64 vector zmm2 
and int64 vector zmm3/m512/m64bcst, and 
set vector mask k1 to reflect the zero/nonzero 
status of each element of the result, under 
writemask. 


Instruction Operand Encoding 


Qp/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an SIMD compare for equality of the packed quadwords in the destination operand (first operand) and the 
source operand (second operand). If a pair of data elements is equal, the corresponding data element in the desti¬ 
nation is set to all Is; otherwise, it is set to Os. 

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM destination 
register remain unchanged. 

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM register 
are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. 

EVEX encoded VPCMPEQQ: The first source operand (second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated 
according to the writemask k2. 
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Operation 

PCMPEQQ (with 128-bit operands) 

IF (DEST[63:0] = SRC[63:0]) 

THEN DEST[63:0] ^ FFFFFFFFFFFFFFFFH; 
ELSE DEST[63:0] ^ 0; FI; 

IF (DEST[127:64] = SRC[127:64]) 

THEN DEST[127:64] ^ FFFFFFFFFFFFFFFFH; 
ELSE DEST[127:64] ^ 0; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

COMPARE_QWORDS_EQUAL (SRC1, SRC2) 

IFSRC1[63:0] = SRC2[63:0] 

THEN DEST[63:0] ^FFFFFFFFFFFFFFFFH; 
ELSE DEST[63:0] ^0; FI; 

IF SRC1 [127:64] = SRC2[127:64] 

THEN DEST[127:64] ^FFFFFFFFFFFFFFFFH; 
ELSE DEST[127:64] ^0; FI; 


VPCMPEQQ (VEX.128 encoded version) 

DEST[127:0] ^COMPARE_QWORDS_EQUAL(SRC1,SRC2) 
DEST[VLMAX-1:128]^0 


VPCMPEQQ (VEX.256 encoded version) 

DEST[127:0] ^COMPARE_QWORDS_EQUAL(SRC1 [127:0],SRC2[127:0]) 
DEST[255:128] ^COMPARE_QWORDS_EQUAL(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 


VPCMPEQQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ J * 64 

IF k20] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN CMP ^ SRC1[i+63:i] = SRC2[63:0]; 

ELSE CMP ^ SRC1 [i+63:i] = SRC2[i+63:i]; 

FL¬ 
IP CMP = TRUE 

THEN DESTO] ^ 1; 

ELSE DESTO] ^ 0; FI; 

ELSE DESTO] ^ 0 ; zeroing-masking only 

FI; 

ENDFOR 

DEST[MAX_KL-1:KL] ^0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPCMPEQQ_mmask8_mm512_cmpeq_epi64_mask(_m5121 a,_m5121 b); 

VPCMPEQQ_mmask8_mm512_mask_cmpeq_epi64_mask(_mmask8 k,_m512i a,_m5121 b); 

VPCMPEQQ_mmask8_mm256_cmpeq_epi64_mask(_m256i a,_m256i b); 

VPCMPEQQ_mmask8_mm256_mask_cmpeq_epi64_mask(_mmask8 k,_m256i a,_m256i b); 

VPCMPEQQ_mmask8_mm_cmpeq_epi64_mask(_ml 28i a,_ml 281 b); 

VPCMPEQQ_mmask8 _mm_mask_cmpeq_epi64_mask(_mmask8 k,_ml 281 a,_ml 281 b); 

(V)PCMPEQQ: _ml 281 _mm_cmpeq_epi64(_ml 281 a,_ml 28i b); 


PCMPEQQ — Compare Packed Qword Data for Equal 


Vol. 2B 4-251 


INSTRUCTION SET REFERENCE, M-U 


VPCMPEQQ: _m256l _mm256_cmpeq_epl64(_m256i a,_m256l b); 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded VPCMPEQQ, see Exceptions Type E4. 
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PCMPESTRI — Packed Compare Explicit Leng 

th Strings, Return Index 

Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 0F3A61 /rimmS 

PCMPESTRI xmml, xmmZ/ml28, imm8 

RMI 

V/V 

SSE4_2 

Perform a packed comparison of string data 
with explicit lengths, generating an index, and 
storing the result in ECX. 

VEX.128.66.0F3A61 /r lb 

VPCMPESTRI xmml, xmm2/ml28, imm8 

RMI 

v/v 

AVX 

Perform a packed comparison of string data 
with explicit lengths, generating an index, and 
storing the result in ECX. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r) 

ModRM:r/m (r) 

imm8 

NA 


Description 

The instruction compares and processes data from two string fragments based on the encoded value in the Imm8 
Control Byte (see Section 4.1, "Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMP- 
ISTRM"), and generates an index stored to the count register (ECX). 

Each string fragment is represented by two values. The first value is an xmm (or possibly ml28 for the second 
operand) which contains the data elements of the string (byte or word data). The second value is stored in an input 
length register. The input length register is EAX/RAX (for xmml) or EDX/RDX (for xmm2/ml28). The length repre¬ 
sents the number of bytes/words which are valid for the respective xmm/ml28 data. 

The length of each input is interpreted as being the absolute-value of the value in the length register. The absolute- 
value computation saturates to 16 (for bytes) and 8 (for words), based on the value of imm8[bit3] when the value 
in the length register is greater than 16 (8) or less than -16 (-8). 

The comparison and aggregation operations are performed according to the encoded value of Imm8 bit fields (see 
Section 4.1). The index of the first (or last, according to imm8[6]) set bit of IntRes2 (see Section 4.1.4) is returned 
in ECX. If no bits are set in IntRes2, ECX is set to 16 (8). 

Note that the Arithmetic Flags are written in a non-standard manner in order to supply the most relevant informa¬ 
tion: 

CFlag - Reset if IntResZ is equal to zero, set otherwise 
ZFlag - Set if absolute-value of EDX is < 16 (8), reset otherwise 
SFlag - Set if absolute-value of EAX is < 16 (8), reset otherwise 
OFlag - lntRes2[0] 

AFlag - Reset 
PFIag - Reset 


Effective Operand Size 


Operating mode/size 

Operand 1 

Operand 2 

Length 1 

Length 2 

Resuit 

16 bit 

xmm 

xmm/ml 28 

EAX 

EDX 

ECX 

32 bit 

xmm 

xmm/ml 28 

EAX 

EDX 

ECX 

64 bit 

xmm 

xmm/ml 28 

EAX 

EDX 

ECX 

64 bit + REX.W 

xmm 

xmm/ml 28 

RAX 

RDX 

ECX 


Intel C/C++ Compiler Intrinsic Equivalent For Returning Index 

int _mm_cmpestri (_ml 28i a, int la,_ml 281 b, int lb, const int mode); 
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Intel C/C++ Compiler Intrinsics For Reading EFIag Results 

int _mm_cmpestra (_ml 281 a, int la,_ml 281 b, Int lb, const Int mode); 

int _mm_cmpestrc (_ml 281 a, Int la,_ml 281 b, int lb, const Int mode); 

Int _mm_cmpestro (_ml 281 a, Int la,_ml 281 b, Int lb, const Int mode); 

int _mm_cmpestrs (_ml 28i a, int la,_ml 28i b, int lb, const Int mode); 

Int _mm_cmpestrz (_ml 281 a, Int la,_ml 281 b, int lb, const Int mode); 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally, this instruction does not cause #GP if the memory operand is not aligned to 16 

Byte boundary, and 

#UD IfVEX.L=l. 

If VEX.vvvv ^ llllB. 
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PCMPESTRM — Packed Compare Explicit Length Strings, Return Mask 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 60 /r immS 

PCMPESTRM xmml, xmm2/m128, imm8 

RMI 

V/V 

SSE4_2 

Perform a packed comparison of string data 
with explicit lengths, generating a mask, and 
storing the result in XMMO. 

VEX.128.66.0F3A 60 /r lb 

VPCMPESTRM xmm 1, xmm2/m 128, imm8 

RMI 

v/v 

AVX 

Perform a packed comparison of string data 
with explicit lengths, generating a mask, and 
storing the result in XMMO. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r) 

ModRM:r/m (r) 

imm8 

NA 


Description 

The instruction compares data from two string fragments based on the encoded value in the imm8 contol byte (see 
Section 4.1, "Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM"), and gener¬ 
ates a mask stored to XMMO. 

Each string fragment is represented by two values. The first value is an xmm (or possibly ml28 for the second 
operand) which contains the data elements of the string (byte or word data). The second value is stored in an input 
length register. The input length register is EAX/RAX (for xmml) or EDX/RDX (for xmm2/ml28). The length repre¬ 
sents the number of bytes/words which are valid for the respective xmm/ml28 data. 

The length of each input is interpreted as being the absolute-value of the value in the length register. The absolute- 
value computation saturates to 16 (for bytes) and 8 (for words), based on the value of imm8[bit3] when the value 
in the length register is greater than 16 (8) or less than -16 (-8). 

The comparison and aggregation operations are performed according to the encoded value of Imm8 bit fields (see 
Section 4.1). As defined by imm8[6], IntRes2 is then either stored to the least significant bits of XMMO (zero 
extended to 128 bits) or expanded into a byte/word-mask and then stored to XMMO. 

Note that the Arithmetic Flags are written in a non-standard manner in order to supply the most relevant informa¬ 
tion: 

CFlag - Reset if IntResZ is equal to zero, set otherwise 
ZFlag - Set if absolute-value of EDX is < 16 (8), reset otherwise 
SFlag - Set if absolute-value of EAX is < 16 (8), reset otherwise 
OFlag -lntRes2[0] 

AFlag - Reset 
PFIag - Reset 


Note: In VEX. 128 encoded versions, bits (VLMAX-1:128) of XMMO are zeroed. VEX.vvvv is reserved and must be 
1111b, VEX.L must be 0, otherwise the instruction will #UD. 
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Effective Operand Size 


Operating mode/size 

Operandl 

Operand 2 

Length1 

LengthZ 

Result 

16 bit 

xmm 

xmm/ml 28 

EAX 

EDX 

XMMO 

32 bit 

xmm 

xmm/ml 28 

EAX 

EDX 

XMMO 

64 bit 

xmm 

xmm/ml 28 

EAX 

EDX 

XMMO 

64 bit + REX.W 

xmm 

xmm/ml28 

RAX 

RDX 

XMMO 


Intel C/C++ Compiler Intrinsic Equivalent For Returning Mask 

_ml Z8I _mm_cmpestrm (_ml 281 a, Int la,_ml 281 b, Int lb, const Int mode); 

Intel C/C++ Compiler Intrinsics For Reading EFIag Results 

int _mm_cmpestra (_ml 28i a, Int la,_ml 281 b, Int lb, const Int mode); 

Int _mm_cmpestrc (_ml 281 a, Int la,_ml 281 b, int lb, const Int mode); 

int _mm_cmpestro (_ml 281 a, int la,_ml 281 b, int lb, const int mode); 

int _mm_cmpestrs (_ml 281 a, int la,_ml 28i b, int lb, const int mode); 

int _mm_cmpestrz (_ml 281 a, int la,_ml 281 b, int lb, const int mode); 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally, this instruction does not cause #GP if the memory operand is not aligned to 16 

Byte boundary, and 

#UD IfVEX.L=l. 

If VEX.vvvv ^ llllB. 
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PCMPGTB/PCMPGTW/PCMPGTD—Compare Packed Signed Integers for Greater Than 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 64 /r' 

PCMPGTB mm, mm/m64 

RM 

V/V 

MMX 

Compare packed signed byte integers in mm and 
mm/m64 for greater than. 

66 OF 64 /r 

PCMPGTB xmm 7, xmmZ/m 7 Z8 

RM 

v/v 

SSE2 

Compare packed signed byte integers in xmml 
and xmmZ/m 1Z8 for greater than. 

OF 65 /r' 

PCMPGTW mm, mm/m64 

RM 

V/V 

MMX 

Compare packed signed word integers in mm and 
mm/m64 for greater than. 

66 OF 65 /r 

PCMPGTW xmml, xmmZ/mlZ8 

RM 

v/v 

SSE2 

Compare packed signed word integers in xmml 
and xmmZ/m 1Z8 for greater than. 

OF 66 /r' 

PCMPGTD mm, mm/m64 

RM 

v/v 

MMX 

Compare packed signed doubleword integers in 
mm and mm/m64 for greater than. 

66 OF 66 /r 

PCMPGTD xmml, xmmZ/m1Z8 

RM 

v/v 

SSE2 

Compare packed signed doubleword integers in 
xmm 1 and xmmZ/m 1Z8 for greater than. 

VEX.NDS.128.66.0F.WIG 64 /r 

VPCMPGTB xmm 1, xmmZ, xmm3/m 7 Z8 

RVM 

v/v 

AVX 

Compare packed signed byte integers in xmmZ 
and xmm3/m 7 28 for greater than. 

VEX.NDS.128.66.0F.WIG65/r 

VPCMPGTW xmml, xmmZ, xmm3/mlZ8 

RVM 

v/v 

AVX 

Compare packed signed word integers in xmmZ 
and xmm3/m 7 28 for greater than. 

VEX.NDS.128.66.0F.WIG 66 /r 

VPCMPGTD xmml, xmmZ, xmm3/mlZ8 

RVM 

v/v 

AVX 

Compare packed signed doubleword integers in 
xmmZ and xmm3/m 1Z8 for greater than. 

VEX.NDS.256.66.0F.WIG 64 /r 

VPCMPGTB ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Compare packed signed byte integers in ymmZ 
and ymm3/mZ56 for greater than. 

VEX.NDS.256.66.0F.WIG 65 /r 

VPCMPGTW ymml, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Compare packed signed word integers in ymmZ 
and ymm3/mZ56 for greater than. 

VEX.NDS.256.66.0F.WIG 66 /r 

VPCMPGTD ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Compare packed signed doubleword integers in 
ymmZ and ymm3/mZ56 for greater than. 

EVEX.NDS.128.66.0F.W0 66 /r 

VPCMPGTD k1 {k2}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare Greater between int32 vector xmm2 and 
int32 vector xmm3/m128/m32bcst, and set 
vector mask k1 to reflect the zero/nonzero status 
of each element of the result, under writemask. 

EVEX.NDS.256.66.0F.W0 66 /r 

VPCMPGTD k1 {k2},ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare Greater between int32 vector ymm2 and 
int32 vector ymm3/m256/m32bcst, and set 
vector mask kl to reflect the zero/nonzero status 
of each element of the result, under writemask. 

EVEX.NDS.512.66.0F.W0 66 /r 

VPCMPGTD k1 {k2}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Compare Greater between int32 elements in 
zmm2 and zmm3/m512/m32bcst, and set 
destination kl according to the comparison results 
under writemask. k2. 

EVEX.NDS.128.66.0F.WIG 64 /r 

VPCMPGTB k1 {k2}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed signed byte integers in xmm2 
and xmm3/m128 for greater than, and set vector 
mask kl to reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.256.66.0F.WIG 64 /r 

VPCMPGTB k1 {k2}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed signed byte integers in ymm2 
and ymm3/m256 for greater than, and set vector 
mask kl to reflect the zero/nonzero status of each 
element of the result, under writemask. 
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EVEX.NDS.512.66.0F.WIG 64 /r 

VPCMPGTB k1 {k2}, zmm2, zmm3/m512 

FVM 

V/V 

AVX512BW 

Compare packed signed byte integers in zmm2 and 
zmm3/m512 for greater than, and set vector 
mask k1 to reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.128.66.0F.WIG 65 It 

VPCMPGTW k1 {k2}, xmm2, xmm3/m128 

FVM 

V/V 

AVX512VL 

AVX512BW 

Compare packed signed word integers in xmm2 
and xmm3/m128 for greater than, and set vector 
mask k1 to reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.256.66.0F.WIG 65 It 

VPCMPGTW k1 {k2}, ymm2, ymm3/m256 

FVM 

V/V 

AVX512VL 

AVX512BW 

Compare packed signed word integers in ymm2 
and ymm3/m256 for greater than, and set vector 
mask k1 to reflect the zero/nonzero status of each 
element of the result, under writemask. 

EVEX.NDS.512.66.0F.WIG 65 It 

VPCMPGTW k1 {k2}, zmm2, zmm3/m512 

FVM 

V/V 

AVX512BW 

Compare packed signed word integers in zmm2 
and zmm3/m512 for greater than, and set vector 
mask k1 to reflect the zero/nonzero status of each 
element of the result, under writemask. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an SIMD signed compare for the greater value of the packed byte, word, or doubleword integers in the 
destination operand (first operand) and the source operand (second operand). If a data element in the destination 
operand is greater than the corresponding date element in the source operand, the corresponding data element in 
the destination operand is set to all Is; otherwise, it is set to all Os. 

The PCMPGTB instruction compares the corresponding signed byte integers in the destination and source oper¬ 
ands; the PCMPGTW instruction compares the corresponding signed word integers in the destination and source 
operands; and the PCMPGTD instruction compares the corresponding signed doubleword integers in the destina¬ 
tion and source operands. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source operand and destination operand are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM 
destination register remain unchanged. 

VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source operand and destination operand are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM 
register are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. 
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EVEX encoded VPCMPGTD: The first source operand (second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 32-bit memory location. The destination operand (first operand) is a mask register updated 
according to the writemask k2. 

EVEX encoded VPCMPGTB/W: The first source operand (second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand 
(first operand) is a mask register updated according to the writemask k2. 

Operation 

PCMPGTB (with 64-bit operands) 

IF DEST[7:0] > SRC[7:0] 

THEN DEST[7:0) ^ FFH; 

ELSE DEST[7:0] ^ 0; FI; 

(* Continue comparison of 2nd through 7th bytes In DEST and SRC *) 

IF DEST[63:56] > SRC[63:56] 

THEN DEST[63:56] ^ FFH; 

ELSE DEST[63:56] ^ 0; FI; 

COMPARE_BYTES_GREATER (SRC1, SRC2) 

IFSRC1[7:0] > SRC2[7:0] 

THEN DEST[7:0] ^FFH; 

ELSE DEST[7:0] ^0; FI; 

(* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *) 

IF SRC1 [127:120] > SRC2[127:120] 

THEN DEST[127:120] ^FFH; 

ELSE DEST[127:120] ^0; FI; 

COMPARE_WORDS_GREATER (SRC1, SRC2) 

IFSRC1[15:0] > SRC2[15:0] 

THEN DEST[15:0] ^FFFFH; 

ELSE DEST[15:0] ^0; FI; 

(* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *) 

IFSRCin 27:112]>SRC2[127:112] 

THEN DEST[127:112] ^FFFFH; 

ELSE DEST[127:112] ^0; FI; 

COMPARE_DWORDS_GREATER (SRC1, SRC2) 

IFSRC1[31:0] > SRC2[31:0] 

THEN DEST[31:0] ^FFFFFFFFH; 

ELSE DEST[31:0] ^0; FI; 

(* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *) 

IF SRC1 [127:96] > SRC2[127:96] 

THEN DEST[127:96] ^FFFFFFFFH; 

ELSE DEST[127:96] ^0; FI; 


PCMPGTB (with 128-bit operands) 

DEST[127:0] ^C0MPARE_BYTES_GREATER(DEST[127:0],SRC[1 27:0]) 
DEST[MAX_VL-1:128] (Unmodified) 


VPCMPGTB (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_BYTES_GREATER(SRC1,SRC2) 
DEST[VLMAX-1:128]^0 
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UPCMPGTB (VEX.256 encoded version) 

DEST[127:0] ^COMPARE_BYTES_GREATER(SRC1 [127:0],SRC2[127:0]) 
DEST[255:128] ^C0MPARE_BYTES_GREATER(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 


VPCMPGTB (EUEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^j*8 

IF k2[j] OR *no writemask* 

THEN 

/* signed comparison */ 

CMP ^ SRC1 [1+7:1] > SRC2[i+7:l]; 

IF CMP = TRUE 

THEN DEST[j]^ 1; 

ELSE DEST[j] ^ 0; FI; 

ELSE DEST[j] <- 0 ; zeroing-masking onlyFI; 

FI; 

ENDFOR 

DEST[MAX_KL-1 :KL] ^ 0 


PCMPGTW (with 64-bit operands) 

IF DEST[15:0] > SRC[15:0] 

THEN DEST[15:0] ^ FFFFH; 

ELSE DEST[15:0] ^ 0; FI; 

(* Continue comparison of 2nd and 3rd words In DEST and SRC *) 

IF DEST[63:48] > SRC[63:48] 

THEN DEST[63:48] ^ FFFFH; 

ELSE DEST[63:48] ^ 0; FI; 

PCMPGTW (with 128-bit operands) 

DEST[127:0] ^C0MPARE_W0RDS_GREATER(DEST[1 27:0],SRC[127:0]) 
DEST[MAX_VL-1:128] (Unmodified) 


VPCMPGTW (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_W0RDS_GREATER(SRC1,SRC2) 
DEST[VLMAX-1:128]^0 


VPCMPGTW (VEX.256 encoded version) 

DEST[127:0] ^C0MPARE_W0RDS_GREATER(SRC1 [1 27:0],SRC2[127:0]) 
DEST[255:128] ^C0MPARE_W0RDS_GREATER(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 


VPCMPGTW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^j* 16 

IF k2[j] OR *no writemask* 

THEN 

/* signed comparison */ 

CMP ^ SRC1 [i+15:i] > SRC2[i+15:1]; 
IF CMP = TRUE 

THEN DEST[j]^ 1; 

ELSE DEST[j] ^ 0; FI; 
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ELSE DEST[j] <- 0 ; zeroIng-maskIng onlyFI; 

FI; 

ENDFOR 

DEST[MAX_KL-1:KL] ^0 


PCMPGTD (with 64-bit operands) 

IFDEST[31:0]>SRC[31:0] 

THEN DEST[31:0] ^ FFFFFFFFH; 

ELSE DEST[31:0]^0;FI; 

IF DEST[63:32] > SRC[63:32] 

THEN DEST[63:32] ^ FFFFFFFFH; 

ELSE DEST[63:32] ^ 0; FI; 

PCMPGTD (with 128-bit operands) 

DEST[127:0] ^COMPARE_DWORDS_GREATER(DEST[127:0],SRC[127:0]) 
DEST[MAX_VL-1:128] (Unmodified) 


VPCMPGTD (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_DW0RDS_GREATER(SRC1,SRC2) 
DEST[VLMAX-1:128]^0 


VPCMPGTD (VEX.256 encoded version) 

DEST[127:0] ^C0MPARE_DW0RDS_GREATER(SRC1 [127:0],SRC2[127:0]) 
DEST[255:128] ^C0MPARE_DW0RDS_GREATER(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 


VPCMPGTD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (8, 512) 

FOR] ^0 TO KL-1 
I ^]*32 

IF k20] OR *no writemask* 

THEN 

/* signed comparison */ 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN CMP ^ SRC1 [i+31 :i] > SRC2[31:0]; 

ELSE CMP ^ SRC1 [i+31 :i] > SRC2[i+31 :i]; 

FI; 

IF CMP = TRUE 

THENDESTO] ^ 1; 

ELSE DESTG] ^ 0; FI; 

ELSE DEST[j] <- 0 ; zeroing-masking only 

FI; 

ENDFOR 

DEST[MAX_KL-1:KL]^0 

Intel C/C-r-i- Compiler Intrinsic Equivalents 

VPCMPGTB_mmask64_mm512_cmpgt_epi8_mask(_m512i a,_m512i b); 

VPCMPGTB_mmask64_mm512_mask_cmpgt_epi8_mask(_mmask64 k,_m512i a,_m5121 b); 

VPCMPGTB_mmask32 _mm256_cmpgt_epi8_mask(_m256i a,_m256i b); 

VPCMPGTB_mmask32 _mm256_mask_cmpgt_epi8_mask(_mmask32 k,_m256i a,_m256i b); 

VPCMPGTB_mmaski 6 _mm_cmpgt_epi8_mask(_ml 281 a,_ml 28i b); 

VPCMPGTB_mmaski 6 _mm_mask_cmpgt_epi8_mask(_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPCMPGTD_mmaski 6 _mm512_cmpgt_epi32_mask(_m512i a,_m512i b); 
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VPCMPGTD_mmask16_mm512_mask_cmpgt_epl32_masl<(_mmasklE k,_m512l a,_m512l b); 

VPCMPGTD_mmaskS _mm256_cmpgt_epl32_mask(_m256l a,_m256i b); 

VPCMPGTD_mmask8_mm256_mask_cmpgt_epi32_mask(_mmaskS k,_m256i a,_m256i b); 

VPCMPGTD_mmaskS _mm_cmpgt_epi32_mask(_ml 28i a,_ml 281 b); 

VPCMPGTD_mmaskS _mm_mask_cmpgt_epi32_mask(_mmaskS k,_ml 281 a,_ml 281 b); 

VPCMPGTW_mmask32 _mm512_cmpgt_epl16_mask(_m5121 a,_mSI 2i b); 

VPCMPGTW_mmask32 _mm512_mask_cmpgt_epi16_mask(_mmask32 k,_mSI 21 a,_mSI 21 b); 

VPCMPGTW_mmaski 6 _mm256_cmpgt_epl16_mask(_m256l a,_m256i b); 

VPCMPGTW_mmaski 6 _mm256_mask_cmpgt_epi16_mask(_mmaski 6 k,_m256l a,_m256l b); 

VPCMPGTW_mmaskS _mm_cmpgt_epl16_mask(_ml 281 a,_ml 281 b); 

VPCMPGTW_mmaskS _mm_mask_cmpgt_epl16_mask(_mmaskS k,_ml 281 a,_ml 281 b); 

PCMPGTB:_m64 _mm_cmpgt_pl8 (_m64 ml,_m64 m2) 

PCMPGTW:_m64_mm_pcmpgt_pi16 (_m64 ml,_m64 m2) 

PCMPGTD:_m64_mm_pcmpgt_pl32 (_m64 ml,_m64 m2) 

(V)PCMPGTB:_ml 281 _mm_cmpgt_epl8 (_ml 281 a,_ml 281 b) 

(V)PCMPGTW:_ml 281 _mm_cmpgt_epi16 (_ml 281 a,_ml 281 b) 

(V)DCMPGTD:_ml 28i _mm_cmpgt_epl32 (_ml 281 a,_ml 281 b) 

VPCMPGTB: _m256l _mm256_cmpgt_epl8 (_m256i a,_m256l b) 

VPCMPGTW: _m256l _mm256_cmpgt_epi16 ( m256i a, m256i b) 

VPCMPGTD: _m256l _mm256_cmpgt_epi32 ( m256l a, m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded VPCMPGTD, see Exceptions Type E4. 

EVEX-encoded VPCMPGTB/W, see Exceptions Type E4.nb. 
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PCMPGTQ — Compare Packed Data for Greater Than 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 37 /r 

PCMPGTQ xmm 1,xmm2/m 128 

RM 

V/V 

SSE4_2 

Compare packed signed qwords in xmm2/m128 
and xmml for greater than. 

VEX.NDS.128.66.0F38.WIG 37 /r 

VPCMPGTQ xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Compare packed signed qwords in xmm2 and 
xmm3/m 128 for greater than. 

VEX.NDS.256.66.0F38.WIG 37 /r 

VPCMPGTQ ymm 1, ymm2, \/mm3/m256 

RVM 

V/V 

AVX2 

Compare packed signed qwords in ymmZ and 
ymm3/m256 for greater than. 

EVEX.NDS.128.66.0F38.W1 37/r 

VPCMPGTQ k1 (k2}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare Greater between int64 vector xmm2 and 
int64 vector xmm3/m128/m64bcst, and set 
vector mask k1 to reflect the zero/nonzero status 
of each element of the result, under writemask. 

EVEX.NDS.256.66.0F38.W1 37 /r 

VPCMPGTQ k1 [k2}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare Greater between int64 vector ymm2 and 
int64 vector ymm3/m256/m64bcst, and set 
vector mask kl to reflect the zero/nonzero status 
of each element of the result, under writemask. 

EVEX.NDS.512.66.0F38.W1 37/r 

VPCMPGTQ k1 [k2}, zmm2, zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Compare Greater between int64 vector zmm2 and 
int64 vector zmm3/m512/m64bcst, and set 
vector mask kl to reflect the zero/nonzero status 
of each element of the result, under writemask. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Qperand 2 

Qperand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an SIMD signed compare for the packed quadwords in the destination operand (first operand) and the 
source operand (second operand). If the data element in the first (destination) operand is greater than the 
corresponding element in the second (source) operand, the corresponding data element in the destination is set 
to all Is; otherwise, it is set to Os. 

128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source operand and destination operand are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM 
destination register remain unchanged. 

VEX. 128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The 
first source operand and destination operand are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM 
register are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. 

EVEX encoded VPCMPGTD/Q: The first source operand (second operand) is a ZMM/YMM/XMM register. The second 
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector 
broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated 
according to the writemask k2. 
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Operation 

COMPARE_QWORDS_GREATER (SRC1, SRC2) 

IFSRC1[63:0] > SRC2[63:0] 

THEN DEST[63:0] ^FFFFFFFFFFFFFFFFH; 

ELSE DEST[63:0] ^0; FI; 

IF SRC1 [127:64] > SRC2[127:64] 

THEN DEST[127:64] ^FFFFFFFFFFFFFFFFH; 

ELSE DEST[127:64] ^0; FI; 

VPCMPGTQ (VEX.128 encoded version) 

DEST[127:0] ^C0MPARE_QW0RDS_GREATER(SRC1,SRC2) 

DEST[VLMAX-1:128]^0 

VPCMPGTQ (VEX.256 encoded version) 

DEST[127:0] ^COMPARE_QWORDS_GREATER(SRC1 [127:0],SRC2[127:0]) 

DEST[255:128] ^COMPARE_QWORDS_GREATER(SRC1 [255:128],SRC2[255:128]) 
DEST[VLMAX-1:256]^0 

VPCMPGTQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k2[j] OR *no writemask* 

THEN 

/* signed comparison */ 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN CMP ^ SRC1 [1+63:1] > SRC2[63:0]; 

ELSE CMP ^ SRC1 [1+63:1] > SRC2[I+63:I]; 

FI; 

IF CMP = TRUE 

THEN DEST[j]^ 1; 

ELSE DEST[j] ^ 0; FI; 

ELSE DEST[j] <- 0 ; zeroing-masking only 

FI; 

ENDFOR 

DEST[MAX_KL-1 :KL] ^ 0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPCMPGTQ_mmask8_mm512_cmpgt_epl64_mask(_m512l a,_m512l b); 

VPCMPGTQ_mmask8_mm512_mask_cmpgt_epl64_mask(_mmask8 k,_m512l a,_m512i b); 

VPCMPGTQ_mmask8 _mm256_cmpgt_epl64_mask(_m256l a,_m256l b); 

VPCMPGTQ_mmask8 _mm256_mask_cmpgt_epl64_mask(_mmask8 k,_m256l a,_m256l b); 

VPCMPGTQ_mmask8 _mm_cmpgt_epi64_mask(_ml 281 a,_ml 281 b); 

VPCMPGTQ_mmask8 _mm_mask_cmpgt_epi64_mask(_mmask8 k,_ml 281 a,_ml 28i b); 

(V)PCMPGTQ: _ml 281 _mm_cmpgt_epl64(_ml 281 a,_ml 281 b) 

VPCMPGTQ: _m256l _mm256_cmpgt_epi64(_m256l a,_m256i b); 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded VPCMPGTQ, see Exceptions Type E4. 
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PCMPISTRI — Packed Compare Implicit Length Strings, Return Index 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 63/r/mmS 

PCMPISTRI xmm 7, xmm2/m128, immS 

RM 

V/V 

SSE4_2 

Perform a packed comparison of string data 
with implicit lengths, generating an index, and 
storing the result in ECX. 

VEX.128.66.0F3A.WIG63/r lb 

VPCMPISTRI xmmi, xmm2/m128, imm8 

RM 

v/v 

AVX 

Perform a packed comparison of string data 
with implicit lengths, generating an index, and 
storing the result in ECX. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

imm8 

NA 


Description 

The instruction compares data from two strings based on the encoded value in the Imm8 Control Byte (see Section 
4.1, "Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM"), and generates an 
index stored to ECX. 

Each string is represented by a single value. The value is an xmm (or possibly ml28 for the second operand) which 
contains the data elements of the string (byte or word data). Each input byte/word is augmented with a 
valid/invalid tag. A byte/word is considered valid only if it has a lower index than the least significant null 
byte/word. (The least significant null byte/word is also considered invalid.) 

The comparison and aggregation operations are performed according to the encoded value of Imm8 bit fields (see 
Section 4.1). The index of the first (or last, according to imm8[6]) set bit of IntRes2 is returned in ECX. If no bits 
are set in IntRes2, ECX is set to 16 (8). 

Note that the Arithmetic Flags are written in a non-standard manner in order to supply the most relevant informa¬ 
tion: 

CFlag - Reset if IntResZ is equal to zero, set otherwise 
ZFlag - Set if any byte/word of xmm2/mem128 is null, reset otherwise 
SFlag - Set if any byte/word of xmmi is null, reset otherwise 
OFlag -lntRes2[0] 

AFlag - Reset 
PFIag - Reset 

Note: In VEX.128 encoded version, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the 
instruction will #UD. 


Effective Operand Size 


Operating mode/size 

Operand 1 

Operand 2 

Resuit 

16 bit 

xmm 

xmm/ml 28 

ECX 

32 bit 

xmm 

xmm/ml 28 

ECX 

64 bit 

xmm 

xmm/ml 28 

ECX 


Intel C/C++ Compiler Intrinsic Equivalent For Returning Index 

int _mm_cmpistri (_ml 28i a,_ml 281 b, const int mode); 
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Intel C/C++ Compiler Intrinsics For Reading EFIag Results 

int _mm_cmplstra (_ml 281 a,_ml 281 b, const Int mode); 

int _mm_cmplstrc (_ml 281 a,_ml 28i b, const Int mode); 

int _mm_cmpistro (_ml 28i a,_ml 281 b, const int mode); 

int _mm_cmpistrs (_ml 281 a,_ml 281 b, const int mode); 

int _mm_cmpistrz (_ml 28i a,_ml 28i b, const int mode); 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally, this instruction does not cause #GP if the memory operand is not aligned to 

16 Byte boundary, and 

#UD IfVEX.L=l. 

If VEX.vvvv ^ llllB. 
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PCMPISTRM — Packed Compare Implicit Leng 

th String 

s. Return Mask 

Opcode/ 

Instruction 

Op/ 

Gn 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Fiag 

Description 

66 OF 3A 62 /r imm8 

PCMPISTRM xmmi, xmmZ/ml28, imm8 

RM 

V/V 

SSE4_2 

Perform a packed comparison of string data 
with implicit lengths, generating a mask, and 
storing the result in XMMO. 

VEX.128.66.0F3A.WIG62/rib 

VPCMPISTRM xmmi, xmm2/m128, imm8 

RM 

v/v 

AVX 

Perform a packed comparison of string data 
with implicit lengths, generating a Mask, and 
storing the result in XMMO. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

imm8 

NA 


Description 

The instruction compares data from two strings based on the encoded value in the imm8 byte (see Section 4.1, 
"Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM") generating a mask 
stored to XMMO. 

Each string is represented by a single value. The value is an xmm (or possibly ml28 for the second operand) which 
contains the data elements of the string (byte or word data). Each input byte/word is augmented with a 
valid/invalid tag. A byte/word is considered valid only if it has a lower index than the least significant null 
byte/word. (The least significant null byte/word is also considered invalid.) 

The comparison and aggregation operation are performed according to the encoded value of Imm8 bit fields (see 
Section 4.1). As defined by imm8[6], IntRes2 is then either stored to the least significant bits of XMMO (zero 
extended to 128 bits) or expanded into a byte/word-mask and then stored to XMMO. 

Note that the Arithmetic Flags are written in a non-standard manner in order to supply the most relevant informa¬ 
tion: 

CFlag - Reset if IntResZ is equal to zero, set otherwise 
ZFlag - Set if any byte/word of xmm2/mem128 is null, reset otherwise 
SFlag - Set if any byte/word of xmmi is null, reset otherwise 
OFlag - lntRes2[0] 

AFlag - Reset 
PFIag - Reset 

Note: In VEX.128 encoded versions, bits (VLMAX-1:128) of XMMO are zeroed. VEX.vvvv is reserved and must be 
1111b, VEX.L must be 0, otherwise the instruction will #UD. 

Effective Operand Size 


Operating mode/size 

Operand 1 

Operand 2 

Result 

16 bit 

xmm 

xmm/ml 28 

XMMO 

32 bit 

xmm 

xmm/ml 28 

XMMO 

64 bit 

xmm 

xmm/ml 28 

XMMO 


Intel C/C++ Compiler Intrinsic Equivalent For Returning Mask 

_ml 281 _mm_cmpistrm (_ml 28i a,_ml 281 b, const int mode); 
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Intel C/C++ Compiler Intrinsics For Reading EFIag Results 

int _mm_cmplstra (_ml 281 a,_ml 281 b, const Int mode); 

int _mm_cmplstrc (_ml 281 a,_ml 28i b, const Int mode); 

int _mm_cmpistro (_ml 28i a,_ml 281 b, const int mode); 

int _mm_cmpistrs (_ml 281 a,_ml 281 b, const int mode); 

int _mm_cmpistrz (_ml 28i a,_ml 28i b, const int mode); 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally, this instruction does not cause #GP if the memory operand is not aligned to 

16 Byte boundary, and 

#UD IfVEX.L=l. 

If VEX.vvvv ^ llllB. 


PCMPISTRM — Packed Compare Implicit Length Strings, Return Mask 
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PDEP — Parallel Bits De 

posit 

Opcode/ 

Instruction 

Op/ 

En 

64/32 

-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

VEX.NDS.LZ.F2.0F38.W0 F5 /r 
PDEP r32a, r32b, r/m32 

RVM 

V/V 

BMI2 

Parallel deposit of bits from r32b using mask in r/m32, result is writ¬ 
ten to r32a. 

VEX.NDS.LZ.F2.0F38.W1 F5 /r 
PDEP r64a, r64b, r/m64 

RVM 

V/N.E. 

BMI2 

Parallel deposit of bits from r64b using mask in r/m64, result is writ¬ 
ten to r64a. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits 
in the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits 
from the first source operand and deposit them in the destination operand at the corresponding bit locations that 
are set in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero. 


SRCl 


SRC2 

(mask) 


DEST 



bit 31 


- bit 0 


Figure 4-8. PDEP Example 

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in 
64-bit mode. In 64-bit mode operand size 64 requires VEX.Wl. VEX.Wl is ignored in non-64-bit modes. An 
attempt to execute this instruction with VEX.L not equal to 0 will cause #UD. 

Operation 

TEMP ^ SRCl; 

MASK ^ SRC2; 

DEST ^ 0 ; 

0, l«- 0; 

DO WHILE m< OperandSize 

IF MASK[ m] = 1 THEN 
DEST[ m] ^ TEMP[ k]; 
k^k+ 1; 

FI 

m <- m+ 1; 

OD 
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Flags Affected 

None. 

Intel C/C++ Compiler Intrinsic Equivalent 

PDEP: unsigned int32 _pdep_u32(unsigned int32 src, unsigned int32 mask); 

PDEP: unsigned int64 _pdep_u64(unsigned int64 src, unsigned int32 mask); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Section 2.5.1, "Exception Conditions for VEX-Encoded GPR Instructions", Table 2-29; additionally 
#UD IfVEX.W=l. 


PDEP — Parallel Bits Deposit 
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PEXT — Parallel Bits Extract 


Opcode/ 

Instruction 

Op/ 

En 

64/32 

-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

VEX.NDS.LZ.F3.0F38.W0 F5 /r 
PEXT r32a, rSZb, r/m32 

RVM 

V/V 

BMI2 

Parallel extract of bits from r32b using mask in r/m32, result is writ¬ 
ten to r32a. 

VEX.NDS.LZ.F3.0F38.W1 F5 /r 
PEXT r64a, r64b, r/m64 

RVM 

V/N.E. 

BMI2 

Parallel extract of bits from r64b using mask in r/m64, result is writ¬ 
ten to r64a. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

PEXT uses a mask in the second source operand (the third operand) to transfer either contiguous or non-contig- 
uous bits in the first source operand (the second operand) to contiguous low order bit positions in the destination 
(the first operand). For each bit set in the MASK, PEXT extracts the corresponding bits from the first source operand 
and writes them into contiguous lower bits of destination operand. The remaining upper bits of destination are 
zeroed. 



bit 31-^- bitO 


Figure 4-9. PEXT Example 

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in 
64-bit mode. In 64-bit mode operand size 64 requires VEX.Wl. VEX.Wl is ignored in non-64-bit modes. An 
attempt to execute this instruction with VEX.L not equal to 0 will cause #UD. 

Operation 

TEMP^SRCI; 

MASK ^ SRC2; 

DEST ^ 0 ; 
m<- 0, l«- 0; 

DO WHILE m< OperandSize 

IF MASK[ m] = 1 THEN 
DEST[ k] ^ TEMP[ m]; 
k^k+ 1; 

FI 
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m 


m+ 1; 


OD 

Flags Affected 

None. 

Intel C/C++ Compiler Intrinsic Equivalent 

PEXT: unsigned Int32 _pext_u32(unslgned Int32 src, unsigned int32 mask); 

PEXT: unsigned int64 _pext_u64(unsigned int64 src, unsigned int32 mask); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Section 2.5.1, "Exception Conditions for VEX-Encoded GPR Instructions", Table 2-29; additionally 
#UD IfVEX.W=l. 


PEXT — Parallel Bits Extract 
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PEXTRB/PEXTRD/P6XTRQ - Extract Byte/Dword/Qword 


Opcode/ 

Instruction 

Op/ En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 14 
/r ib 

PEXTRB reg/mS, xmmZ, imm8 

MRI 

V/V 

SSE4_1 

Extract a byte integer value from xmmZ at the 
source byte offset specified by /mmS Into reg or 
m8. The upper bits of r32 or r64 are zeroed. 

66 OF 3A 16 
/r ib 

PEXTRD r/m32, xmmZ, imm8 

MRI 

v/v 

SSE4_1 

Extract a dword Integer value from xmmZ at the 
source dword offset specified by imm8 Into r/m3Z. 

66 REX.W OF 3A 16 
/r ib 

PEXTRQ r/m64, xmmZ, imm8 

MRI 

V/N.E. 

SSE4_1 

Extract a qword Integer value from xmmZ at the 
source qword offset specified by imm8 Into r/m64. 

VEX.128.66.0F3A.W0 14/rib 

VPEXTRB reg/m8, xmmZ, imm8 

MRI 

V'/V 

AVX 

Extract a byte Integer value from xmmZ at the 
source byte offset specified by /mmS Into reg or 
m8. The upper bits of r64/r32 is filled with zeros. 

VEX.128.66.0F3A.W0 16/rib 

VPEXTRD rSZ/mZZ, xmmZ, imm8 

MRI 

V/V 

AVX 

Extract a dword integer value from xmmZ at the 
source dword offset specified by imm8 Into 
r3Z/m3Z. 

VEX.128.66.0F3A.W1 16/rib 

VPEXTRQ r64/m64, xmmZ, imm8 

MRI 

V/i 

AVX 

Extract a qword integer value from xmmZ at the 
source dword offset specified by imm8 Into 
r64/m64. 

EVEX.128.66.0F3A.WIG 14/rib 

VPEXTRB reg/m8, xmm2, imm8 

T1S-MRI 

V/V 

AVX512BW 

Extract a byte integer value from xmm2 at the 
source byte offset specified by ImmS Into reg or 
m8. The upper bits of r64/r32 Is filled with zeros. 

EVEX.128.66.0F3A.W0 16/r ib 

VPEXTRD r32/m32, xmm2, immB 

T1S-MRI 

V/V 

AVX512DQ 

Extract a dword integer value from xmm2 at the 
source dword offset specified by ImmS Into 
r32/m32. 

EVEX.128.66.0F3A.W1 16/rib 

VPEXTRQ r64/m64, xmm2, immS 

T1S-MRI 

V/N.E.' 

AVX512DQ 

Extract a qword Integer value from xmm2 at the 
source dword offset specified by ImmS Into 
r64/m64. 


NOTES: 

1. In 64-blt mode, VEX.WI is ignored for VPEXTRB (similar to legacy REX.W=1 prefix In PEXTRB). 

2. VEX.W/EVEX.W In non-64 bit is ignored; the instructions behaves as if the WO version is used. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Operand 2 

Operand 3 

Operand 4 

MRI 

ModRM:r/m (w) 

ModRM:reg (r) 

immS 

NA 


Description 

Extract a byte/dword/qword integer value from the source XMM register at a byte/dword/qword offset determined 
from imm8[3:0]. The destination can be a register or byte/dword/qword memory location. If the destination is a 
register, the upper bits of the register are zero extended. 

In legacy non-VEX encoded version and if the destination operand is a register, the default operand size in 64-bit 
mode for PEXTRB/PEXTRD is 64 bits, the bits above the least significant byte/dword data are filled with zeros. 
PEXTRQ is not encodable in non-64-bit modes and requires REX.W in 64-bit mode. 
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Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the 
instruction will #UD. In EVEX.128 encoded versions, EVEX.vvvv is reserved and must be 1111b, EVEX.L"L must be 
0, otherwise the instruction will #UD. If the destination operand is a register, the default operand size in 64-bit 
mode for VPEXTRB/VPEXTRD is 64 bits, the bits above the least significant byte/word/dword data are filled with 
zeros. Attempt to execute VPEXTRQ in non-64-bit mode will cause #UD. 

Operation 

CASE of 

PEXTRB: SEL ^ C0UNT[3:0]; 

TEMP ^ (Src >> SEL*8) AND FFH; 

IF (DEST = Mem8) 

THEN 

Mem8 ^ TEMP[7:0]; 

ELSE IF (64-Blt Mode and 64-blt register selected) 

THEN 

R64[7:0] ^ TEMP[7:0]; 
r64[63:8] ^ ZERO_FILL;}; 

ELSE 

R32[7:0] ^ TEMP[7:0]; 
r32[31:8]^ZERO_FILL;}; 

FI; 

PEXTRD:SEL ^ C0UNT[1:0]; 

TEMP ^ (Src >> SEL*32) AND FFFF_FFFFH; 

DEST ^ TEMP; 

PEXTRQ: SEL ^ C0UNT[0]; 

TEMP^(Src>>SEL*64); 

DEST ^ TEMP; 

EASE: 

VPEXTRTD/VPEXTRQ 

IF (64-Blt Mode and 64-blt dest operand) 

THEN 

Src_Offset <- Imm8[0] 
r64/m64 ^(Src >> Src_Offset * 64) 

ELSE 

Src_Offset <- Imm8[1:0] 

r32/m32 ^ ((Src >> Src_Offset *32) AND OFFFFFFFFh); 

FI 

VPEXTRB (dest=ni8) 

SRC_Offset ^ Imm8[3:0] 

Mem8 <- (Src >> Src_0ffset*8) 

VPEXTRB (dest=reg) 

IF (64-Bit Mode) 

THEN 

SRC_Offset ^ Imm8[3:0] 

DEST[7:0] ^ ((Src >> Src_Offset*8) AND OFFh) 

DEST[63:8] ^ ZERO_FILL; 

ELSE 

SRC_Offset Imm8[3:0]; 

DEST[7:0] ^ ((Src >> Src_Offset*8) AND OFFh); 

DEST[31:8]^ ZERO_FILL; 

FI 


PEXTRB/PEXTRD/PEXTRQ - Extract Byte/Dword/Qword 
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Intel C/C++ Compiler Intrinsic Equivalent 

PEXTRB: int _mm_extract_epl8 (_ml 281 src, const Int ndx); 

PEXTRD: Int _mm_extract_epi32 (_ml 281 src, const Int ndx); 

PEXTRQ: _Int64 _mm_extract_epl64 (_ml 281 src, const Int ndx); 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; 
EVEX-encoded instruction, see Exceptions Type E9NF. 

#UD If VEX.L = 1 or EVEX.L'L > 0. 

If VEX.vvvv != llllB or EVEX.vvvv != llllB. 
If VPEXTRQ in non-64-bit mode, VEX.W=1. 
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PEXTRW-Extract Word 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF C5 /rib' 

PEXTRW reg, mm, immS 

RMI 

V/V 

SSE 

Extract the word specified by ImmS from mm 
and move it to reg, bits 15-0. The upper bits of 
r32 or r64 is zeroed. 

66 OF C5 /rib 

PEXTRW reg, xmm, immS 

RMI 

v/v 

SSE2 

Extract the word specified by /mmS from xmm 
and move it to reg, bits 15-0. The upper bits of 
r32 or r64 is zeroed. 

66 OF 3A 15 
/r ib 

PEXTRW reg/m 16, xmm, immS 

MRI 

V/V 

SSE4_1 

Extract the word specified by /mmS from xmm 
and copy it to lowest 16 bits of reg or ml6. 
Zero-extend the result in the destination, r32 
or r64. 

VEX.128.66.0F.W0 C5 /r ib 

VPEXTRW reg, xmm 1, ImmS 

RMI 

W/v 

AVX 

Extract the word specified by ImmS from 
xmm 1 and move it to reg, bits 15:0. Zero- 
extend the result. The upper bits of r64/r32 is 
filled with zeros. 

VEX.128.66.0F3A.W0 15/rib 

VPEXTRW reg/m 16, xmmZ, ImmS 

MRI 

v/v 

AVX 

Extract a word integer value from xmmZ at 
the source word offset specified by ImmS Into 
reg or ml6. The upper bits of r64/r32 Is filled 
with zeros. 

EVEX.128.66.0F.WIG C5 /r ib 

VPEXTRW reg, xmmi, immS 

RMI 

v/v 

AVX512B 

W 

Extract the word specified by imm8 from 
xmmi and move It to reg, bits 15:0. Zero- 
extend the result. The upper bits of r64/r32 is 
filled with zeros. 

EVEX.128.66.0F3A.WIG 15/r ib 

VPEXTRW reg/ml 6, xmm2, imm8 

T1S- 

MRI 

v/v 

AVX512B 

W 

Extract a word integer value from xmm2 at 
the source word offset specified by immS into 
reg or ml 6. The upper bits of r64/r32 is filled 
with zeros. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 

2. In 64-bit mode, VEX.WI is ignored for VPEXTRW (similar to legacy REX.W=1 prefix In PEXTRW). 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

Imm8 

NA 

MRI 

ModRM:r/m (w) 

ModRM:reg (r) 

imm8 

NA 


Description 

Copies the word in the source operand (second operand) specified by the count operand (third operand) to the 
destination operand (first operand). The source operand can be an MMX technology register or an XMM register. 
The destination operand can be the low word of a general-purpose register or a 16-bit memory address. The count 
operand is an 8-bit immediate. When specifying a word location in an MMX technology register, the 2 least-signifi¬ 
cant bits of the count operand specify the location; for an XMM register, the 3 least-significant bits specify the loca¬ 
tion. The content of the destination register above bit 16 is cleared (set to all Os). 

In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers 
(XMM8-XMM15, R8-15). If the destination operand is a general-purpose register, the default operand size is 64-bits 
in 64-bit mode. 


PEXTRW-Extract Word 
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Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the 
instruction will #UD. In EVEX.128 encoded versions, EVEX.vvvv is reserved and must be 1111b, EVEX.L must be 0, 
otherwise the instruction will #UD. If the destination operand is a register, the default operand size in 64-bit mode 
for VPEXTRW is 64 bits, the bits above the least significant byte/word/dword data are filled with zeros. 

Operation 

IF(DEST = Mem16) 

THEN 

SEL ^ C0UNT[2:0]; 

TEMP ^ (Src >> SEL*16) AND FFFFH; 

Mem16 ^TEMP[15:0]; 

ELSE IF (64-Blt Mode and destination Is a general-purpose register) 

THEN 

FOR (PEXTRW instruction with 64-bit source operand) 

{ SEL^C0UNT[1:0]; 

TEMP ^ (SRC » (SEL * 16)) AND FFFFH; 
r64[15:0] ^ TEMP[15:0]; 
r64[63:16]^ZER0_FILL;}; 

FOR (PEXTRW instruction with 128-bit source operand) 

[ SEL ^ C0UNT[2:0]; 

TEMP ^ (SRC » (SEL * 16)) AND FFFFH; 
r64[15:0] ^ TEMP[15:0]; 
r64[63:16]^ZER0_FILL;} 

ELSE 

FOR (PEXTRW instruction with 64-bit source operand) 

[ SEL^C0UNT[1:0]; 

TEMP ^ (SRC » (SEL * 16)) AND FFFFH; 

r32[15:0]^TEMP[15:0]; 

r32[31:16]^ZER0_FILL;}; 

FOR (PEXTRW instruction with 128-bit source operand) 

[ SEL ^ C0UNT[2:0]; 

TEMP ^ (SRC » (SEL * 16)) AND FFFFH; 

r32[15:0]^TEMP[15:0]; 

r32[31:16]^ZER0_FILL;}; 

FI; 

FI; 

VPEXTRW (dest=ml 6) 

SRC_Offset ^ Imm8[2:0] 

Memi 6 ^ (Src » Src_Offset*16) 

VPEXTRW (dest=reg) 

IF (64-Bit Mode) 

THEN 

SRC_Offset ^ Imm8[2:0] 

DEST[15:0] ^ ((Src >> Src_Offset*16) AND OFFFFh) 

DEST[63:16]^ ZER0_FILL; 

ELSE 

SRC_Offset ^ Imm8[2:0] 

DEST[15:0] ^ ((Src >> Src_Offset*16) AND OFFFFh) 

DEST[31:16]^ ZER0_FILL; 

FI 
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Intel C/C++ Compiler Intrinsic Equivalent 

PEXTRW: int _mm_extract_pl16 ( m64 a, Int n) 

PEXTRW: int _mm_extract_epi16 ( ml 281 a, Int Imm) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5; 
EVEX-encoded instruction, see Exceptions Type E9NF. 

#UD If VEX.L = 1 or EVEX.L'L > 0. 

If VEX.vvvv != llllB or EVEX.vvvv != llllB. 


PEXTRW-Extract Word 
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PHADDW/PHADDD - Packed Horizontal Add 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 01 /r ' 

PHADDW mm 1, mm2/m64 

RM 

V/V 

SSSE3 

Add 16-bit integers horizontally, pack to mm 7. 

66 OF 38 01 /r 

PHADDW xmm 1, xmm2/m 128 

RM 

v/v 

SSSE3 

Add 16-bit integers horizontally, pack to 
xmm 7. 

OF 38 02 /r 

PHADDD mm 1, mm2/m64 

RM 

V/V 

SSSE3 

Add 32-bit integers horizontally, pack to mm 7. 

66 OF 38 02 It 

PHADDD xmm 1, xmm2/m 128 

RM 

v/v 

SSSE3 

Add 32-bit integers horizontally, pack to 
xmm 7. 

VEX.NDS.128.66.0F38.WIC01 /r 

VPHADDW xmm 1, xmmZ, xmm3/m 128 

RVM 

v/v 

AVX 

Add 16-bit integers horizontally, pack to 
xmm 7. 

VEX.NDS.128.66.0F38.WIC 02 /r 

VPHADDD xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Add 32-bit integers horizontally, pack to 
xmm 7. 

VEX.NDS.256.66.0F38.WIC 01 /r 

VPHADDW ymm 1, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Add 16-bit signed integers horizontally, pack 
to ymm 7. 

VEX.NDS.256.66.0F38.WIC 02 /r 

VPHADDD ymmi, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Add 32-bit signed integers horizontally, pack 
to ymm 7. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" In the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
In the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

(V)PHADDW adds two adjacent 16-bit signed integers horizontally from the source and destination operands and 
packs the 16-bit signed results to the destination operand (first operand). (V)PHADDD adds two adjacent 32-bit 
signed integers horizontally from the source and destination operands and packs the 32-bit signed results to the 
destination operand (first operand). When the source operand is a 128-bit memory operand, the operand must be 
aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. 

Note that these instructions can operate on either unsigned or signed (two's complement notation) integers; 
however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected over¬ 
flow conditions, software must control the ranges of the values operated on. 

Legacy SSE instructions: Both operands can be MMX registers. The second source operand can be an MMX register 
or a 64-bit memory location. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand can be an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM 
destination register remain unchanged. 

In 64-bit mode, use the REX prefix to access additional registers. 
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VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand can be an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM 
register are zeroed. 

VEX.256 encoded version: Horizontal addition of two adjacent data elements of the low 16-bytes of the first and 
second source operands are packed into the low 16-bytes of the destination operand. Horizontal addition of two 
adjacent data elements of the high 16-bytes of the first and second source operands are packed into the high 16- 
bytes of the destination operand. The first source and destination operands are VMM registers. The second source 
operand can be an VMM register or a 256-bit memory location. 

Note: VEX.L must be 0, otherwise the instruction will #UD. 



Figure 4-10. 256-bit VPHADDD Instruction Operation 


Operation 

PHADDW (with 64-bit operands) 

mml [15-0] = mml [31-16] + mml [15-0]; 
mml [31-16] = mml [63-48] + mml [47-32]; 
mml [47-32] = mm2/m64[31 -16] + mm2/m64[15-0]; 
mml [63-48] = mm2/m64[63-48] + mm2/m64[47-32]; 

PHADDW (with 128-bit operands) 

xmmi [15-0] = xmmi [31 -16] -r xmmi [15-0]; 
xmmi [31-16] = xmmi [63-48] + xmmi [47-32]; 
xmmi [47-32] = xmmi [95-80] + xmmi [79-64]; 
xmmi [63-48] = xmmi [127-112] + xmmi [111 -96]; 
xmmi [79-64] = xmm2/m128[31-16] + xmm2/m128[15-0]; 
xmmi [95-80] = xmm2/m128[63-48] + xmm2/m128[47-32]; 
xmmi [111 -96] = xmm2/m128[95-80] + xmm2/m128[79-64]; 
xmmi [127-112] = xmm2/m128[127-112] + xmm2/m128[111 -96]; 

VPHADDW (VEX.128 encoded version) 

DEST[15:0] ^ SRC1 [31:16] + SRC1 [15:0] 

DEST[31:16] ^ SRC1 [63:48] + SRC1 [47:32] 

DEST[47:32] ^ SRC1 [95:80] + SRC1 [79:64] 

DEST[63:48] ^ SRC1 [127:112] + SRC1 [111:96] 

DEST[79:64] ^ SRC2[31:16] + SRC2[15:0] 

DEST[95:80] ^ SRC2[63:48] + SRC2[47:32] 

DEST[111:96] ^ SRC2[95:80] + SRC2[79:64] 

DEST[127:112] ^ SRC2[127:112] + SRC2[111:96] 
DEST[VLMAX-1:128]^0 
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VPHADDW (VEX.256 encoded version) 

DEST[15:0] ^ SRC1 [31:16] + SRC1 [15:0] 

DEST[31:16] ^ SRC1 [63:48] + SRC1 [47:32] 
DEST[47:32] ^ SRC1 [95:80] + SRC1 [79:64] 
DEST[63:48] ^ SRC1 [127:112] + SRC1 [111:96] 
DEST[79:64] ^ SRC2[31:16] + SRC2[15:0] 
DEST[95:80] ^ SRC2[63:48] + SRC2[47:32] 

DEST[111:96] ^ SRC2[95:80] + SRC2[79:64] 

DEST[127:112] ^ SRC2[127:112] + SRC2[111:96] 
DEST[143:128] ^ SRC1 [159:144] + SRC1 [143:128] 
DEST[159:144] ^ SRC1 [191:176] + SRC1 [175:160] 
DEST[175:160] ^ SRC1 [223:208] + SRC1 [207:192] 
DEST[191:176] ^ SRC1 [255:240] + SRC1 [239:224] 
DEST[207:192] ^ SRC2[127:112] + SRC2[143:128] 
DEST[223:208] ^ SRC2[159:144] + SRC2[175:160] 
DEST[239:224] ^ SRC2[191:176] + SRC2[207:192] 
DEST[255:240] ^ SRC2[223:208] + SRC2[239:224] 


PHADDD (with 64-bit operands) 

mm1[31-0] =mm1 [63-32]+ mm1 [31-0]; 

mml [63-32] = mm2/m64[63-32] + mm2/m64[31 -0]; 


PHADDD (with 128-bit operands) 

xmmi [31-0] = xmmi [63-32] + xmmi [31 -0]; 
xmmi [63-32] = xmmi [127-96] + xmmi [95-64]; 
xmmi [95-64] = xmm2/m128[63-32] + xmm2/m128[31 -0]; 
xmmi [127-96] = xmm2/m128[127-96] + xmm2/m128[95-64]; 


VPHADDD (VEX.128 encoded version) 

DEST[31 -0] ^ SRC1 [63-32] + SRC1 [31-0] 
DEST[63-32] ^ SRC1 [127-96] + SRC1 [95-64] 
DEST[95-64] ^ SRC2[63-32] + SRC2[31-0] 
DEST[127-96] ^ SRC2[127-96] + SRC2[95-64] 
DEST[VLMAX-1:128]^0 


VPHADDD (VEX.256 encoded version) 

DEST[31 -0] ^ SRC1 [63-32] + SRC1 [31-0] 
DEST[63-32] ^ SRC1 [127-96] + SRC1 [95-64] 
DEST[95-64] ^ SRC2[63-32] + SRC2[31-0] 

DEST[127-96] ^ SRC2[127-96] + SRC2[95-64] 
DEST[159-128] ^ SRC1 [191 -160] + SRC1 [159-128] 
DEST[191 -160] ^ SRC1 [255-224] + SRC1 [223-192] 
DEST[223-192] ^ SRC2[191 -160] + SRC2[159-128] 
DEST[255-224] ^ SRC2[255-224] + SRC2[223-192] 


Intel C/C++ Compiler Intrinsic Equivalents 


PHADDW: 

PHADDD: 

(V)PHADDW: 

(V)PHADDD: 

VPHADDW: 

VPHADDD: 


_m64 _mm_hadd_pi16 (_m64 a,_m64 b) 

_m64 _mm_hadd_pi32 (_m64 a,_m64 b) 

_m1281 _mm_hadd_epi16 (_ml 281 a,_ml 281 b) 

ml 281 _mm_hadd_epi32 (_ml 281 a,_ml 28i b) 

_m256i _mm256_hadd_epi16 (_m256i a,_m256i b) 

_m256i _mm256_hadd_epi32 (_m256i a,_m256i b) 
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SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 
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PHADDSW — Packed Horizontal Add and Saturate 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 03 /r' 

PHADDSW mm 1, mmZ/m64 

RM 

V/V 

SSSE3 

Add 16-bit signed integers horizontally, pack 
saturated integers to mml. 

66 OF 38 03 It 

PHADDSW xmm 1, xmmZ/m 1Z8 

RM 

v/v 

SSSE3 

Add 16-bit signed integers horizontally, pack 
saturated integers to xmml. 

VEX.NDS.128.66.0F38.WIG 03 It 

VPHADDSW xmml, xmmZ, xmm3/mlZ8 

RVM 

V/V 

AVX 

Add 16-bit signed integers horizontally, pack 
saturated integers to xmml. 

VEX.NDS.256.66.0F38.WIG 03 It 

VPHADDSW ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Add 16-bit signed integers horizontally, pack 
saturated integers to ymml. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

(V)PHADDSW adds two adjacent signed 16-bit integers horizontally from the source and destination operands and 
saturates the signed results; packs the signed, saturated 16-bit results to the destination operand (first operand) 
When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a 
general-protection exception (#GP) will be generated. 

Legacy SSE version: Both operands can be MMX registers. The second source operand can be an MMX register or a 
64-bit memory location. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

In 64-bit mode, use the REX prefix to access additional registers. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 

VEX.256 encoded version: The first source and destination operands are VMM registers. The second source 
operand can be an VMM register or a 256-bit memory location. 

Note: VEX.L must be 0, otherwise the instruction will #UD. 

Operation 

PHADDSW (with 64-bit operands) 

mml [15-0] = SaturateToSignedWord((mm1 [31-16] + mml [15-0]); 
mm1[31-16] = SaturateToSignedWord(mm1 [63-48] + mml [47-32]); 
mml [47-32] = SaturateToSignedWord(mm2/m64[31 -16] + mm2/m64[15-0]); 
mml [63-48] = SaturateToSignedWord(mm2/m64[63-48] + mm2/m64[47-32]); 
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PHADDSW (with 128-bit operands) 

xmmi [15-0]= SaturateToSignedWord(xmm1 [31-16] + xmmi [15-0]); 
xmmi [31-16] = SaturateToSignedWord(xmm1 [63-48] + xmmi [47-32]); 
xmmi [47-32] = SaturateToSignedWord(xmm1 [95-80] + xmmi [79-64]); 
xmmi [63-48] = SaturateToSignedWord(xmm1 [127-112] + xmmi [111 -96]); 
xmmi [79-64] = SaturateToSignedWord(xmm2/m128[31 -16] + xmm2/m128[15-0]); 
xmmi [95-80] = SaturateToSignedWord(xmm2/m128[63-48] + xmm2/m128[47-32]); 
xmmi [111 -96] = SaturateToSignedWord(xmm2/m128[95-80] + xmm2/m128[79-64]); 
xmmi [127-112] = SaturateToSignedWord(xmm2/m128[127-112] + xmm2/m128[111 -96]); 

VPHADDSW (VEX.128 encoded version) 

DEST[15:0]= SaturateToSignedWord(SRC1 [31:16] + SRC1 [15:0]) 

DEST[31:16] = SaturateToSignedWord(SRC1 [63:48] + SRC1 [47:32]) 

DEST[47:32] = SaturateToSignedWord(SRC1 [95:80] + SRC1 [79:64]) 

DEST[63:48] = SaturateToSignedWord(SRC1 [127:112] + SRC1 [111:96]) 

DEST[79:64] = SaturateToSignedWord(SRC2[31:16] + SRC2[15:0]) 

DEST[95:80] = SaturateToSignedWord(SRC2[63:48] + SRC2[47:32]) 

DEST[111:96] = SaturateToSignedWord(SRC2[95:80] + SRC2[79:64]) 

DEST[127:112] = SaturateToSignedWord(SRC2[127:112] + SRC2[111:96]) 
DEST[VLMAX-1:128]^0 

VPHADDSW {VEX.256 encoded version) 

DEST[15:0]= SaturateToSignedWord(SRC1 [31:16] + SRC1 [15:0]) 

DEST[31:16] = SaturateToSignedWord(SRC1 [63:48] + SRC1 [47:32]) 

DEST[47:32] = SaturateToSignedWord(SRC1 [95:80] + SRC1 [79:64]) 

DEST[63:48] = SaturateToSignedWord(SRC1 [127:112] + SRC1 [111:96]) 

DEST[79:64] = SaturateToSignedWord(SRC2[31:16] + SRC2[15:0]) 

DEST[95:80] = SaturateToSignedWord(SRC2[63:48] + SRC2[47:32]) 

DEST[111:96] = SaturateToSignedWord(SRC2[95:80] + SRC2[79:64]) 

DEST[127:112] = SaturateToSignedWord(SRC2[127:112] + SRC2[111:96]) 

DEST[143:128]= SaturateToSignedWord(SRC1 [159:144] + SRC1 [143:128]) 

DEST[159:144] = SaturateToSignedWord(SRC1 [191:176] + SRC1 [175:160]) 

DEST[175:160] = SaturateToSignedWord( SRC1 [223:208] + SRC1 [207:192]) 

DEST[191:176] = SaturateToSignedWord(SRC1 [255:240] + SRC1 [239:224]) 

DEST[207:192] = SaturateToSignedWord(SRC2[127:112] + SRC2[143:128]) 

DEST[223:208] = SaturateToSignedWord(SRC2[159:144] + SRC2[175:160]) 

DEST[239:224] = SaturateToSignedWord(SRC2[191 -160] + SRC2[159-128]) 

DEST[255:240] = SaturateToSignedWord(SRC2[255:240] + SRC2[239:224]) 

Intel C/C++ Compiler Intrinsic Equivalent 

PHADDSW: _m64 _mm_hadds_pi 16 (_m64 a,_m64 b) 

(V)PHADDSW: _m1281 _mm_hadds_epi16 (_m1281 a, _m1281 b) 

VPHADDSW: _m256i _mm256_hadds_epi16 (_m256i a, _m256i b) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 
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PHMINPOSUW — Packed Horizontal Word Minimum 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 41 /r 

PHMINPOSUW xmml, xmmZ/mlZ8 

RM 

V/V 

SSE4_1 

Find the minimum unsigned word in 
xmmZ/m1Z8 and place its value in the low 
word of xmml and its index in the second- 
lowest word of xmml. 

VEX.128.66.0F38.WIG41 /r 

VPHMINPOSUW xmml, xmmZ/mlZ8 

RM 

v/v 

AVX 

Find the minimum unsigned word in 
xmmZ/mlZ8an6 place its value in the low 
word of xmml and its index in the second- 
lowest word of xmml. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Determine the minimum unsigned word value in the source operand (second operand) and place the unsigned 
word in the low word (bits 0-15) of the destination operand (first operand). The word index of the minimum value 
is stored in bits 16-18 of the destination operand. The remaining upper bits of the destination are set to zero. 

128-bit Legacy SSE version: Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

VEX.128 encoded version: Bits (VLMAX-1:128) of the destination VMM register are zeroed. VEX.vvvv is reserved 
and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD. 

Operation 

PHMINPOSUW (1 Z8-bit Legacy SSE version) 

INDEX ^ 0; 

MIN^SRC[15:0] 

IF(SRC[31:16] < MIN) 

THEN INDEX ^ 1; MIN ^ SRC[31:16]; FI; 

IF (SRC[47:32] < MIN) 

THEN INDEX ^ 2; MIN ^ SRC[47:32]; FI; 

* Repeat operation for words 3 through 6 
IF(SRC[127:112] < MIN) 

THEN INDEX ^ 7; MIN ^ SRC[127:112]; FI; 

DEST[15:0] ^ MIN; 

DEST[18:16] ^ INDEX; 

DEST[127:19] ^ OOOOOOOOOOOOOOOOOOOOOOOOOOOOH; 
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VPHMINPOSUW (VEX.128 encoded version) 

INDEX ^ 0 
MIN ^SRC[15:0] 

IF (SRC[31:16] < MIN) THEN INDEX ^ 1; MIN ^ SRC[31:16] 

IF (SRC[47:32] < MIN) THEN INDEX ^ 2; MIN ^ SRC[47:32] 

* Repeat operation for words 3 through 6 

IF (SRC[127:112] < MIN) THEN INDEX ^ 7; MIN ^ SRC[127:112] 

DEST[15:0]^MIN 

DEST[18:16] ^ INDEX 

DEST[127:19] ^ OOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
DEST[VLMAX-1:128]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

PHMINPOSUW: _ml 281 _mm_minpos_epu16( ml 281 packed_words); 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 

If VEX.vvvv ^ llllB. 
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PHSUBW/PHSUBD — Packed Horizontal Subtract 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 05 /r' 

PHSUBW mml, mmZ/m64 

RM 

V/V 

SSSE3 

Subtract 16-bit signed integers horizontally, 
pack to mml. 

66 OF 38 05 It 

PHSUBW xmmi, xmmZ/mlZ8 

RM 

v/v 

SSSE3 

Subtract 16-bit signed integers horizontally, 
pack to xmmi. 

OF 38 06 It 

PHSUBD mml, mmZ/m64 

RM 

V/V 

SSSE3 

Subtract 32-bit signed integers horizontally, 
pack to mml. 

66 OF 38 06 It 

PHSUBD xmmi, xmmZ/m1Z8 

RM 

v/v 

SSSE3 

Subtract 32-bit signed integers horizontally, 
pack to xmmi. 

VEX.NDS.12B.66.0F38.WIG 05 It 

VPHSUBW xmm 1, xmmZ, xmm3/m 1Z8 

RVM 

v/v 

AVX 

Subtract 16-bit signed integers horizontally, 
pack to xmmi. 

VEX.NDS.12B.66.0F38.WIG 06 It 

VPHSUBD xmm 1, xmmZ, xmm3/m 1Z8 

RVM 

v/v 

AVX 

Subtract 32-bit signed integers horizontally, 
pack to xmmi. 

VEX.NDS.256.66.0F38.WIG 05 It 

VPHSUBW ymmi, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Subtract 16-bit signed integers horizontally, 
pack to ymml. 

VEX.NDS.256.66.0F38.WIG 06 It 

VPHSUBD ymml, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Subtract 32-bit signed integers horizontally, 
pack to ymml. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" In the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel" 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (r, w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

(V)PHSUBW performs horizontal subtraction on each adjacent pair of 16-bit signed integers by subtracting the 
most significant word from the least significant word of each pair in the source and destination operands, and packs 
the signed 16-bit results to the destination operand (first operand). (V)PHSUBD performs horizontal subtraction on 
each adjacent pair of 32-bit signed integers by subtracting the most significant doubleword from the least signifi¬ 
cant doubleword of each pair, and packs the signed 32-bit result to the destination operand. When the source 
operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a general-protection 
exception (#GP) will be generated. 

Legacy SSE version: Both operands can be MMX registers. The second source operand can be an MMX register or a 
64-bit memory location. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

In 64-bit mode, use the REX prefix to access additional registers. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 
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VEX.256 encoded version: The first source and destination operands are VMM registers. The second source 
operand can be an VMM register or a 256-bit memory location. 

Note: VEX.L must be 0, otherwise the instruction will #UD. 

Operation 

PHSUBW (with 64-bit operands) 

mml [15-0] = mml [15-0] - mml [31 -16]; 
mml [31-16] = mml [47-32] - mml [63-48]; 
mml [47-32] = mm2/m64[15-0] - mm2/m64[31 -16]; 
mml [63-48] = mm2/m64[47-32] - mm2/m64[63-48]; 

PHSUBW (with 12B-bit operands) 

xmmi [15-0] = xmmi [15-0] - xmmi [31-16]; 
xmmi [31-16] = xmmi [47-32] - xmmi [63-48]; 
xmmi [47-32] = xmmi [79-64] - xmmi [95-80]; 
xmmi [63-48] = xmmi [111 -96] - xmmi [127-112]; 
xmmi [79-64] = xmm2/m128[15-0] - xmm2/m128[31 -16]; 
xmmi [95-80] = xmm2/m128[47-32] - xmm2/m128[63-48]; 
xmmi [111 -96] = xmm2/m128[79-64] - xmm2/m128[95-80]; 
xmmi [127-112] = xmm2/m128[111 -96] - xmm2/m128[127-112]; 

VPHSUBW (VEX.128 encoded version) 

DEST[15:0] ^ SRC1 [15:0] - SRC1 [31:16] 

DEST[31:16] ^ SRC1 [47:32] - SRC1 [63:48] 

DEST[47:32] ^ SRC1 [79:64] - SRC1 [95:80] 

DEST[63:48] ^ SRC1 [111:96] - SRC1 [127:112] 

DEST[79:64] ^ SRC2[15:0] - SRC2[31:16] 

DEST[95:80] ^ SRC2[47:32] - SRC2[63:48] 

DEST[111:96] ^ SRC2[79:64] - SRC2[95:80] 

DEST[127:112] ^ SRC2[111:96] - SRC2[127:112] 

DEST[VLMAX-1:128]^0 

VPHSUBW (VEX.256 encoded version) 

DEST[15:0] ^ SRC1 [15:0] - SRC1 [31:16] 

DEST[31:16] ^ SRC1 [47:32] - SRC1 [63:48] 

DEST[47:32] ^ SRC1 [79:64] - SRC1 [95:80] 

DEST[63:48] ^ SRC1 [111:96] - SRC1 [127:112] 

DEST[79:64] ^ SRC2[15:0] - SRC2[31:16] 

DEST[95:80] ^ SRC2[47:32] - SRC2[63:48] 

DEST[111:96] ^ SRC2[79:64] - SRC2[95:80] 

DEST[127:112] ^ SRC2[111:96] - SRC2[127:112] 

DEST[143:128] ^ SRC1 [143:128] - SRC1 [159:144] 

DEST[159:144] ^ SRC1 [175:160] - SRC1 [191:176] 

DEST[175:160] ^ SRC1 [207:192] - SRC1 [223:208] 

DEST[191:176] ^ SRC1 [239:224] - SRC1 [255:240] 

DEST[207:192] ^ SRC2[143:128] - SRC2[159:144] 

DEST[223:208] ^ SRC2[175:160] - SRC2[191:176] 

DEST[239:224] ^ SRC2[207:192] - SRC2[223:208] 

DEST[255:240] ^ SRC2[239:224] - SRC2[255:240] 

PHSUBD (with 64-bit operands) 

mml [31-0] = mml [31-0] - mml [63-32]; 

mml [63-32] = mm2/m64[31 -0] - mm2/m64[63-32]; 
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PHSUBD (with 1 Z8-bit operands) 

xmmi [31 -0] = xmmi [31-0] - xmmi [63-32]; 
xmmi [63-32] = xmmi [95-64] - xmmi [127-96]; 
xmmi [95-64] = xmmZ/ml 28[31 -0] - xmmZ/ml 28[63-32]; 
xmmi [127-96] = xmm2/m128[95-64] - xmm2/m128[127-96]; 

VPHSUBD (VEX.128 encoded version) 

DEST[31 -0] ^ SRC1 [31 -0] - SRC1 [63-32] 

DEST[63-32] ^ SRC1 [95-64] - SRC1 [127-96] 

DEST[95-64] ^ SRC2[31-0] - SRC2[63-32] 

DEST[127-96] ^ SRC2[95-64] - SRC2[127-96] 
DEST[VLMAX-1:128]^0 


VPHSUBD (VEX.256 encoded version) 

DEST[31:0] ^ SRC1 [31:0] - SRC1 [63:32] 
DEST[63:32] ^ SRC1 [95:64] - SRC1 [127:96] 
DEST[95:64] ^ SRC2[31:0] - SRC2[63:32] 

DEST[127:96] ^ SRC2[95:64] - SRC2[127:96] 
DEST[159:128] ^ SRC1 [159:128] - SRC1 [191:160] 
DEST[191:160] ^ SRC1 [223:192] - SRC1 [255:224] 
DEST[223:192] ^ SRC2[159:128] - SRC2[191:160] 
DEST[255:224] ^ SRC2[223:192] - SRC2[255:224] 


Intel C/C++ Compiler Intrinsic Equivalents 


PHSUBW: 

PHSUBD: 

(V)PHSUBW: 

(V)PHSUBD: 

VPHSUBW: 

VPHSUBD: 


m64 _mm_hsub_pi16 (_m64 a,_m64 b) 

m64 _mm_hsub_pi32 (_m64 a,_m64 b) 

ml 281 _mm_hsub_epi16 (_ml 281 a,_ml 281 b) 

ml 281 _mm_hsub_epi32 (_ml 281 a,_ml 281 b) 

_m256i _mm256_hsub_epi16 (_m256i a,_m256i b) 

m256i_mm256_hsub_epi32 (_m256i a,_m256i b) 


SIMD Floating-Point Exceptions 

None. 


Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 
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PHSUBSW — Packed Horizontal Subtract and Saturate 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 07 /r' 

PHSUBSW mml, mmZ/m64 

RM 

V/V 

SSSE3 

Subtract 16-bit signed integer horizontally, 
pack saturated integers to mml. 

66 OF 38 07 /r 

PHSUBSW xmmi, xmmZ/mlZB 

RM 

v/v 

SSSE3 

Subtract 16-bit signed integer horizontally, 
pack saturated integers to xmmi. 

VEX.NDS.128.66.0F38.WIG 07 /r 

VPHSUBSW xmmi, xmmZ, xmm3/mlZ8 

RVM 

V/V 

AVX 

Subtract 16-bit signed integer horizontally, 
pack saturated integers to xmmi. 

VEX.NDS.256.66.0F38.WIG 07 /r 

VPHSUBSW ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Subtract 16-bit signed integer horizontally, 
pack saturated integers to ymml. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (r, w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

(V)PHSUBSW performs horizontal subtraction on each adjacent pair of 16-bit signed integers by subtracting the 
most significant word from the least significant word of each pair in the source and destination operands. The 
signed, saturated 16-bit results are packed to the destination operand (first operand). When the source operand is 
a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception 
(#GP) will be generated. 

Legacy SSE version: Both operands can be MMX registers. The second source operand can be an MMX register or 
a 64-bit memory location. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

In 64-bit mode, use the REX prefix to access additional registers. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 

VEX.256 encoded version: The first source and destination operands are VMM registers. The second source 
operand can be an VMM register or a 256-bit memory location. 

Note: VEX.L must be 0, otherwise the instruction will #UD. 

Operation 

PHSUBSW (with 64-bit operands) 

mml [15-0] = SaturateToSignedWord(mm1 [15-0] - mml [31 -16]); 
mml [31-16] = SaturateToSignedWord(mm1 [47-32] - mml [63-48]); 
mml [47-32] = SaturateToSignedWord(mm2/m64[15-0] - mm2/m64[31 -16]); 
mml [63-48] = SaturateToSignedWord(mm2/m64[47-32] - mm2/m64[63-48]); 
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PHSUBSW (with 128-bit operands) 

xmmi [15-0] = SaturateToSignedWord(xmm1 [15-0] - xmmi [31-16]); 
xmmi [31-16] = SaturateToSignedWord(xmm1 [47-32] - xmmi [63-48]); 
xmmi [47-32] = SaturateToSignedWord(xmm1 [79-64] - xmmi [95-80]); 
xmmi [63-48] = SaturateToSignedWord(xmm1 [111-96] - xmm1[127-112]); 
xmmi [79-64] = SaturateToSignedWord(xmm2/m128[15-0] - xmm2/m128[31 -16]); 
xmmi [95-80] =SaturateToSignedWord(xmm2/m128[47-32] - xmm2/m128[63-48]); 
xmmi [111 -96] =SaturateToSignedWord(xmm2/m128[79-64] - xmm2/m128[95-80]); 
xmmi [127-112]= SaturateToSignedWord(xmm2/m128[111 -96] - xmm2/m128[127-112]); 

VPHSUBSW (VEX.128 encoded version) 

DEST[15:0]= SaturateToSignedWord(SRC1 [15:0] - SRC1 [31:16]) 

DEST[31:16] = SaturateToSignedWord(SRC1 [47:32] - SRC1 [63:48]) 

DEST[47:32] = SaturateToSignedWord(SRC1 [79:64] - SRC1 [95:80]) 

DEST[63:48] = SaturateToSignedWord(SRC1 [111:96] - SRC1 [127:112]) 

DEST[79:64] = SaturateToSignedWord(SRC2[15:0] - SRC2[31:16]) 

DEST[95:80] = SaturateToSignedWord(SRC2[47:32] - SRC2[63:48]) 

DEST[111:96] = SaturateToSignedWord(SRC2[79:64] - SRC2[95:80]) 

DEST[127:112] = SaturateToSignedWord(SRC2[111:96] - SRC2[127:112]) 
DEST[VLMAX-1:128]^0 

VPHSUBSW (VEX.256 encoded version) 

DEST[15:0]= SaturateToSignedWord(SRC1 [15:0] - SRC1 [31:16]) 

DEST[31:16] = SaturateToSignedWord(SRC1 [47:32] - SRC1 [63:48]) 

DEST[47:32] = SaturateToSignedWord(SRC1 [79:64] - SRC1 [95:80]) 

DEST[63:48] = SaturateToSignedWord(SRC1 [111:96] - SRC1 [127:112]) 

DEST[79:64] = SaturateToSignedWord(SRC2[15:0] - SRC2[31:16]) 

DEST[95:80] = SaturateToSignedWord(SRC2[47:32] - SRC2[63:48]) 

DEST[111:96] = SaturateToSignedWord(SRC2[79:64] - SRC2[95:80]) 

DEST[127:112] = SaturateToSignedWord(SRC2[111:96] - SRC2[127:112]) 

DEST[143:128]= SaturateToSignedWord(SRC1 [143:128] - SRC1 [159:144]) 

DEST[159:144] = SaturateToSignedWord(SRC1 [175:160] - SRC1 [191:176]) 

DEST[175:160] = SaturateToSignedWord(SRC1 [207:192] - SRC1 [223:208]) 

DEST[191:176] = SaturateToSignedWord(SRC1 [239:224] - SRC1 [255:240]) 

DEST[207:192] = SaturateToSignedWord(SRC2[143:128] - SRC2[159:144]) 

DEST[223:208] = SaturateToSignedWord(SRC2[175:160] - SRC2[191:176]) 

DEST[239:224] = SaturateToSignedWord(SRC2[207:192] - SRC2[223:208]) 

DEST[255:240] = SaturateToSignedWord(SRC2[239:224] - SRC2[255:240]) 

Intel C/C++ Compiler Intrinsic Equivalent 

PHSUBSW: _m64 _mm_hsubs_pi16 (_m64 a,_m64 b) 

(V)PHSUBSW: _m128i _mm_hsubs_epi16 (_m128i a, _m1281 b) 

VPHSUBSW: _m256i _mm256_hsubs_epi16 (_m256i a, _m256i b) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 
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PINSRB/PINSRD/PINSRQ - Insert Byte/Dword/Qword 


Opcode/ 

Instruction 

Op/ En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 20 /r lb 

PINSRB xmml, r32/m8, imm8 

RMI 

V/V 

SSE4_1 

Insert a byte Integer value from r32/m8 Into 
xmmi at the destination element In xmmi 
specified by imm8. 

66 OF 3A 22 /r lb 

PINSRD xmmi, r/m32, imm8 

RMI 

v/v 

SSE4_1 

Insert a dword Integer value from r/m32 Into 
the xmmi at the destination element 
specified by imm8. 

66 REX.W OF 3A 22 /r lb 

PINSRQ xmmi, r/m64, imm8 

RMI 

V/N. E. 

SSE4_1 

Insert a qword Integer value from r/m64 /nto 
the xmmi at the destination element 
specified by imm8. 

VEX.NDS.128.66.0F3A.W0 20 /r lb 

VPINSRB xmmi, xmm2, r32/m8, imm8 

RVMI 

V'/V 

AVX 

Merge a byte Integer value from r32/mS and 
rest from xmm2 Into xmm 1 at the byte offset 
In imm8. 

VEX.NDS.128.66.0F3A.W0 22 /r lb 

VPINSRD xmmi, xmm2, r/m32, imm8 

RVMI 

V/V 

AVX 

Insert a dword Integer value from r32/m32 
and rest from xmm2 Into xmm 1 at the dword 
offset In imm8. 

VEX.NDS.128.66.0F3A.W1 22/r lb 

VPINSRQ xmmi, xmm2, r/m64, imm8 

RVMI 

V/l 

AVX 

Insert a qword Integer value from r64/m64 
and rest from xmm2 Into xmm 1 at the qword 
offset In imm8. 

EVEX.NDS.128.66.0F3A.WIG 20 /r lb 

VPINSRB xmmi, xmm2, r32/m8, ImmB 

T1S- 

RVMI 

V/V 

AVX512BW 

Merge a byte Integer value from r32/m8 and 
rest from xmm2 Into xmmi at the byte offset 
In Imm8. 

EVEX.NDS.128.66.0F3A.W0 22 /r lb 

VPINSRD xmmi, xmm2, r32/m32, ImmB 

T1S- 

RVMI 

V/V 

AVX512DQ 

Insert a dword Integer value from r32/m32 
and rest from xmm2 Into xmmi at the dword 
offset In ImmS. 

EVEX.NDS.128.66.0F3A.W1 22/r lb 

VPINSRQ xmmi, xmm2, r64/m64, ImmB 

T1S- 

RVMI 

V/N.E.' 

AVX512DQ 

Insert a qword Integer value from r64/m64 
and rest from xmm2 Into xmmi at the qword 
offset In ImmS. 


NOTES: 

1. In 64-blt mode, VEX.WI Is Ignored for VPINSRB (similar to legacy REX.W= 1 prefix with PINSRB). 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

ImmS 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 

T1S-RVMI 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

ImmS 


Description 

Copies a byte/dword/qword from the source operand (second operand) and inserts it in the destination operand 
(first operand) at the location specified with the count operand (third operand). (The other elements in the desti¬ 
nation register are left untouched.) The source operand can be a general-purpose register or a memory location. 
(When the source operand is a general-purpose register, PINSRB copies the low byte of the register.) The destina¬ 
tion operand is an XMM register. The count operand is an 8-bit immediate. When specifying a qword[dword, byte] 
location in an XMM register, the [2, 4] least-significant bit(s) of the count operand specify the location. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15, R8-15). Use of REX.W permits the use of 64 bit general purpose regis¬ 
ters. 
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128-bit Legacy SSE version: Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

VEX.128 encoded version: Bits (VLMAX-1:128) of the destination register are zeroed. VEX.L must be 0, otherwise 
the instruction will #UD. Attempt to execute VPINSRQ in non-64-bit mode will cause #UD. 

EVEX.128 encoded version: Bits (VLMAX-1:128) of the destination register are zeroed. EVEX.L'L must be 0, other¬ 
wise the instruction will #UD. 

Operation 

CASE OF 

PINSRB: SEL ^ C0UNT[3:0]; 

MASK ^ (OFFH « (SEL * 8)); 

TEMP ^ (((SRC[7:0] << (SEL *8)) AND MASK); 

PINSRD: SEL ^ C0UNT[1:0]; 

MASK ^ (OFFFFFFFFH << (SEL * 32)); 

TEMP ^ (((SRC << (SEL *32)) AND MASK) ; 

PINSRQ: SEL^COUNT[0] 

MASK ^ (OFFFFFFFFFFFFFFFFH « (SEL * 64)); 

TEMP ^ (((SRC << (SEL *64)) AND MASK) ; 

ESAC; 

DEST ^ ((DEST AND NOT MASK) OR TEMP); 

VPINSRB (VEX/EVEX encoded version) 

SEL ^ imm8[3:0] 

DEST[127:0] ^ wrlte_b_element(SEL, SRC2, SRC1) 

DEST[VLMAX-1:1281^0 

VPINSRD (VEX/EVEX encoded version) 

SEL^ imm8[1:0] 

DEST[127:0] ^ write_d_element(SEL, SRC2, SRC1) 

DEST[VLMAX-1:128]^0 

VPINSRQ (VEX/EVEX encoded version) 

SEL ^ imm8[0] 

DEST[127:0] ^ write_q_element(SEL, SRC2, SRC1) 

DEST[VLMAX-1:128]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

PINSRB: _ml 281 _mm_lnsert_epl8 (_ml 281 si, Int s2, const Int ndx); 

PINSRD: _ml 281 _mm_lnsert_epl32 ( ml 281 s2, Int s, const int ndx); 

PINSRQ: _ml 281 _mm_lnsert_epl64( ml 281 s2, Int64 s, const Int ndx); 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

EVEX-encoded instruction, see Exceptions Type 5; 

EVEX-encoded instruction, see Exceptions Type E9NF. 
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#UD 


PINSRB/PINSRD/PINSRQ 


If VEX.L = 1 or EVEX.L'L > 0. 

If VPINSRQ in non-64-bit mode with VEX.W=1. 


Insert Byte/Dword/Qword 
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PINSRW-lnsert Word 


Opcode/ 

Instruction 

Op/ En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF C4 /rib' 

PINSRW mm, r3Z/m16, immS 

RMI 

V/V 

SSE 

Insert the low word from r3Z or from ml6 
into mm at the word position specified by 
ImmS. 

66 OF C4 /rib 

PINSRW xmm, r3Z/ml6, immS 

RMI 

v/v 

SSE2 

Move the low word of r3Z or from m 16 into 
xmm at the word position specified by ImmS. 

VEX.NDS.128.66.0F.W0 C4 /r lb 

VPINSRW xmml, xmmZ, r3Z/m16, ImmS 

RVMI 

W/y 

AVX 

Insert a word integer value from r3Z/m16 
and rest from xmmZ Into xmml at the word 
offset In ImmS. 

EVEX.NDS.128.66.0F.WIG C4 /r lb 

VPINSRW xmml, xmm2, r32/m16, imm8 

T1S- 

RVMI 

v/v 

AVX512BW 

Insert a word integer value from r32/m16 and 
rest from xmm2 into xmml at the word 
offset in imm8. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 

2. In 64-blt mode, VEX.WI is ignored for VPINSRW (similar to legacy REX.W=1 prefix in PINSRW). 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 

T1S-RVMI 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

ImmS 


Description 

Copies a word from the source operand (second operand) and inserts it in the destination operand (first operand) 
at the location specified with the count operand (third operand). (The other words in the destination register are 
left untouched.) The source operand can be a general-purpose register or a 16-bit memory location. (When the 
source operand is a general-purpose register, the low word of the register is copied.) The destination operand can 
be an MMX technology register or an XMM register. The count operand is an 8-bit immediate. When specifying a 
word location in an MMX technology register, the 2 least-significant bits of the count operand specify the location; 
for an XMM register, the 3 least-significant bits specify the location. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15, R8-15). 

128-bit Legacy SSE version: Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

VEX. 128 encoded version: Bits (VLMAX-1:128) of the destination VMM register are zeroed. VEX.L must be 0, other¬ 
wise the instruction will #UD. 

EVEX.128 encoded version: Bits (VLMAX-1:128) of the destination register are zeroed. EVEX.L'L must be 0, other¬ 
wise the instruction will #UD. 

Operation 

PINSRW (with 64-bit source operand) 

SEL^ COUNT AND 3H; 

CASE (Determine word position) OF 

SEL ^ 0: MASK ^ OOOOOOOOOOOOFFFFH; 
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DEST 


SEL ^ 1: MASK ^ OOOOOOOOFFFFOOOOH; 

SEL ^ 2: MASK ^ OOOOFFFFOOOOOOOOH; 

SEL ^ 3: MASK ^ FFFFOOOOOOOOOOOOH; 

(DEST AND NOT MASK) OR (((SRC « (SEL * 16)) AND MASK); 


PINSRW (with 128-bit source operand) 

SEL ^ COUNT AND 7H; 

CASE (Determine word position) OF 

SEL ^ 0: MASK ^ OOOOOOOOOOOOOOOOOOOOOOOOOOOOFFFFH 

SEL ^ 1: MASK ^ OOOOOOOOOOOOOOOOOOOOOOOOFFFFOOOOH 

SEL ^ 2: MASK ^ OOOOOOOOOOOOOOOOOOOOFFFFOOOOOOOOH 

SEL ^ 3: MASK ^ OOOOOOOOOOOOOOOOFFFFOOOOOOOOOOOOH 

SEL ^ 4: MASK ^ OOOOOOOOOOOOFFFFOOOOOOOOOOOOOOOOH 

SEL ^ 5: MASK ^ OOOOOOOOFFFFOOOOOOOOOOOOOOOOOOOOH 

SEL ^ 6: MASK ^ OOOOFFFFOOOOOOOOOOOOOOOOOOOOOOOOH 

SEL ^ 7: MASK ^ FFFFOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 

DEST ^ (DEST AND NOT MASK) OR (((SRC « (SEL 16)) AND MASK); 


VPINSRW (VEX/EVEX encoded version) 

SEL ^ imm8[2:0] 

DEST[127:0] ^ write_w_element(SEL, SRC2, SRC1) 

DEST[VLMAX-1:1281^0 

Intel C/C++ Compiler Intrinsic Equivalent 

PINSRW: _m64 _mm_insert_pi16 ( m64 a, int d, int n) 

PINSRW: _ml 28i_mm_insert_epi16 ( ml 281 a, int b, int imm) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

EVEX-encoded instruction, see Exceptions Type 5; 
EVEX-encoded instruction, see Exceptions Type E9NF. 

#UD If VEX.L = 1 or EVEX.L'L > 0. 
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PMADDUBSW — Multiply and Add Packed Signed and Unsigned Bytes 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 04 /r' 

PMADDUBSW mm 7, mm2/m64 

RM 

V/V 

SSSE3 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to mm7. 

66 OF 38 04 /r 

PMADDUBSW xmmi, xmm2/ml28 

RM 

v/v 

SSSE3 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to xmm7. 

VEX.NDS.128.66.0F38.WIG 04 /r 

VPMADDUBSW xmml, xmm2, xmm3/m128 

RVM 

V/V 

AVX 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to xmml. 

VEX.NDS.256.66.0F38.WIG 04 /r 

VPMADDUBSW ymm 1, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to ymml. 

EVEX.NDS.128.66.0F38.WIG 04 It 

VPMADDUBSW xmml {k1 }{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to xmml under 
writemask k1. 

EVEX.NDS.256.66.0F38.WIG 04 It 

VPMADDUBSW ymmi [k1 }[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to ymml under 
writemask k1. 

EVEX.NDS.512.66.0F38.WIG 04 It 

VPMADDUBSW zmmi (k1 }(z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Multiply signed and unsigned bytes, add 
horizontal pair of signed words, pack 
saturated signed-words to zmmi under 
writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

(V)PMADDUBSW multiplies vertically each unsigned byte of the destination operand (first operand) with the corre¬ 
sponding signed byte of the source operand (second operand), producing intermediate signed 16-bit integers. Each 
adjacent pair of signed words is added and the saturated result is packed to the destination operand. For example, 
the lowest-order bytes (bits 7-0) in the source and destination operands are multiplied and the intermediate signed 
word result is added with the corresponding intermediate result from the 2nd lowest-order bytes (bits 15-8) of the 
operands; the sign-saturated result is stored in the lowest word of the destination register (15-0). The same oper¬ 
ation is performed on the other pairs of adjacent bytes. Both operands can be MMX register or XMM registers. When 
the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a 
general-protection exception (#GP) will be generated. 

In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15. 
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128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_\/L-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 and EVEX.128 encoded versions: The first source and destination operands are XMM registers. The 
second source operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corre¬ 
sponding destination register are zeroed. 

VEX.256 and EVEX.256 encoded versions: The second source operand can be an VMM register or a 256-bit memory 
location. The first source and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding 
ZMM register are zeroed. 

EVEX.512 encoded version: The second source operand can be an ZMM register or a 512-bit memory location. The 
first source and destination operands are ZMM registers. 

Operation 

PMADDUBSW (with 64 bit operands) 

DEST[15-0] = SaturateToSignedWord(SRC[15-8]*DEST[15-8]+SRC[7-0]*DEST[7-0]); 

DEST[31 -16] = SaturateToSignedWord(SRC[31 -24]*DEST[31 -24]+SRC[23-16]*DEST[23-16]); 

DEST[47-32] = SaturateToSignedWord(SRC[47-40]*DEST[47-40]+SRC[39-32]*DEST[39-32]); 

DEST[63-48] = SaturateToSignedWord(SRC[63-56]*DEST[63-56]+SRC[55-48]*DEST[55-48]); 

PMADDUBSW (with 128 bit operands) 

DEST[15-0] = SaturateToSignedWord(SRC[15-8]* DEST[15-8]+SRC[7-0]*DEST[7-0]); 

// Repeat operation for 2nd through 7th word 

SRC1 /DEST[127-112] = SaturateToSignedWord(SRC[127-120]*DEST[127-120]+ SRC[119-112]* DEST[119-112]); 

VPMADDUBSW (VEX.128 encoded version) 

DEST[15:0] ^ SaturateToSignedWord(SRC2[15:8]* SRC1 [15:8]+SRC2[7:0]*SRC1 [7:0]) 

// Repeat operation for 2nd through 7th word 

DEST[127:112] ^ SaturateToSignedWord(SRC2[127:120]*SRC1 [127:120]+ SRC2[119:112]* SRC1 [119:112]) 
DEST[VLMAX-1:128]eO 

VPMADDUBSW (VEX.256 encoded version) 

DEST[15:0] ^ SaturateToSignedWord(SRC2[15:8]* SRC1 [15:8]+SRC2[7:0]*SRC1 [7:0]) 

// Repeat operation for 2nd through 15th word 

DEST[255:240] ^ SaturateToSignedWord(SRC2[255:248]*SRC1 [255:248]+ SRC2[247:240]* SRC1 [247:240]) 
DEST[VLMAX-1:256]eO 

VPMADDUBSW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k10] OR *no writemask* 

THEN DEST[i+15:1] ^ SaturateToSignedWord(SRC2[i+15:1+8]* SRC1 [i+15:1+8] + SRC2[i+7:i]*SRC1 [i+7:i]) 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 
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Intel C/C++ Compiler Intrinsic Equivalents 

VPMADDUBSW_m5121 _mm512_mddubs_epi16(_m5121 a,_m512i b); 

VPMADDUBSW_mSI 21 _mm512_mask_mddubs_epl16(_m512i s,_mmask32 k,_m512l a,_m512i b); 

VPMADDUBSW_mSI 21 _mm512_maskz_mddubs_epi16(_mmask32 k,_mSI 21 a,_mSI 21 b); 

VPMADDUBSW_m256l _mm256_mask_mddubs_epl16(_nn256i s,_mmasklB k,_m256l a,_m256i b); 

VPMADDUBSW_m256l _mm256_maskz_mddubs_epi16(_mmaskIS k,_m256l a,_m256l b); 

VPMADDUBSW_ml 281 _mm_mask_mddubs_epl16(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPMADDUBSW_ml 281 _mm_maskz_maddubs_epl16(_mmaskB k,_ml 281 a,_ml 281 b); 

PMADDUBSW:_m64 _mm_maddubs_pi 16 (_m64 a,_m64 b) 

(V)PMADDUBSW: _m1281 _mm_maddubs_epi16 (_m1281 a, _m1281 b) 

VPMADDUBSW: _m256i _mm256_maddubs_epi16 (_m256l a_m256i b) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PMADDWD—Multiply and Add Packed Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF F5 /r' 

PMADDWD mm, mm/m64 

RM 

V/V 

MMX 

Multiply the packed words in mm by the packed 
words in mm/m64, add adjacent doubleword 
results, and store in mm. 

66 OF F5 Ir 

PMADDWD xmm 1, xmmZ/m 1Z8 

RM 

v/v 

SSE2 

Multiply the packed word integers in xmml by 
the packed word integers in xmmZ/m 128, add 
adjacent doubleword results, and store in 
xmml. 

VEX.NDS.128.66.0F.WIG F5 /r 

VPMADDWD xmm 1, xmmZ, xmm3/m 1Z8 

RVM 

V/V 

AVX 

Multiply the packed word integers in xmm2 by 
the packed word integers in xmm3/m 128, add 
adjacent doubleword results, and store in 
xmml. 

VEX.NDS.256.66.0F.WIG F5 It 

VPMADDWD ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Multiply the packed word integers in ymmZby 
the packed word integers in ymm3/mZS6, add 
adjacent doubleword results, and store in 
ymmh 

EVEX.NDS.128.66.0F.WIG F5 It 

VPMADDWD xmmi {k1}{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed word integers in xmm2 by 
the packed word integers in xmm3/m128, add 
adjacent doubleword results, and store in 
xmml under writemaskki. 

EVEX.NDS.256.66.0F.WIG F5 It 

VPMADDWD ymmi [k1}[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed word integers in ymm2 by 
the packed word integers in ymm3/m256, add 
adjacent doubleword results, and store in 
ymmi under writemask k1. 

EVEX.NDS.512.66.0F.WIG F5 It 

VPMADDWD zmmi (k1 ][z], zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Multiply the packed word integers in zmm2 by 
the packed word integers in zmm3/m512, add 
adjacent doubleword results, and store in 
zmmi under writemask k1. 


NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel* 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel* 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvw (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiplies the individual signed words of the destination operand (first operand) by the corresponding signed words 
of the source operand (second operand), producing temporary signed, doubleword results. The adjacent double- 
word results are then summed and stored in the destination operand. For example, the corresponding low-order 
words (15-0) and (31-16) in the source and destination operands are multiplied by one another and the double- 
word results are added together and stored in the low doubleword of the destination register (31-0). The same 
operation is performed on the other pairs of adjacent words. (Figure 4-11 shows this operation when using 64-bit 
operands). 
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The (V)PMADDWD instruction wraps around only in one situation: when the 2 pairs of words being operated on in 
a group are all 8000H. In this case, the result wraps around to 80000000H. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version: The first source and destination operands are MMX registers. The second source operand is an 
MMX register or a 64-bit memory location. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX.512 encoded version: The second source operand can be an ZMM register or a 512-bit memory location. The 
first source and destination operands are ZMM registers. 



Figure 4-11. PMADDWD Execution Model Using 64-bit Operands 


Operation 

PMADDWD (with 64-bit operands) 

DEST[31:0] ^ (DEST[15:0] * SRC[15:0]) + (DEST[31:16] * SRC[31:16]); 

DEST[63:32] ^ (DEST[47:32] * SRC[47:32]) + (DEST[63:48] * SRC[63:48]); 

PMADDWD (with 128-bit operands) 

DEST[31:0] ^ (DEST[15:0] * SRC[15:0]) + (DEST[31:16] * SRC[31:16]); 

DEST[63:32] ^ (DEST[47:32] * SRC[47:32]) + (DEST[63:48] * SRC[63:48]); 
DEST[95:64] ^ (DEST[79:64] * SRC[79:64]) + (DEST[95:80] * SRC[95:80]); 

DEST[127:96] ^ (DEST[111:96] * SRC[111:96]) + (DEST[127:112] * SRC[127:112]); 


VPMADDWD (VEX.128 encoded version) 

DEST[31:0] ^ (SRC1 [15:0] * SRC2[15:0]) + (SRC1 [31:16] * SRC2[31:16]) 

DEST[63:32] ^ (SRC1 [47:32] * SRC2[47:32]) -h (SRC1 [63:48] * SRC2[63:48]) 
DEST[95:64] ^ (SRC1 [79:64] * SRC2[79:64]) -h (SRC1 [95:80] * SRC2[95:80]) 

DEST[127:96] ^ (SRC1 [111:96] * SRC2[111:96]) -h (SRCI [127:112] * SRC2[127:112]) 
DEST[VLMAX-1:128]^0 

VPMADDWD (VEX.256 encoded version) 

DEST[31:0] ^ (SRCI [15:0] * SRC2[15:0]) + (SRCI [31:16] * SRC2[31:16]) 

DEST[63:32] ^ (SRCI [47:32] * SRC2[47:32]) -h (SRCI [63:48] * SRC2[63:48]) 
DEST[95:64] ^ (SRCI [79:64] * SRC2[79:64]) -h (SRCI [95:80] * SRC2[95:80]) 

DEST[127:96] ^ (SRCI [111:96] * SRC2[111:96]) -h (SRCI [127:112] * SRC2[127:112]) 
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DEST[159:128] ^ (SRC1 [143:128] * SRC2[143:128]) + (SRC1 [159:144] * SRC2[159:144]) 

DEST[191:160] ^ (SRC1 [175:160] * SRC2[175:160]) + (SRC1 [191:176] * SRC2[191:176]) 
DEST[223:192] ^ (SRC1 [207:192] * SRC2[207:192]) + (SRC1 [223:208] * SRC2[223:208]) 
DEST[255:224] ^ (SRC1 [239:224] * SRC2[239:224]) + (SRC1 [255:240] * SRC2[255:240]) 
DEST[VLMAX-1:256]^0 

VPMADDWD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^]*32 

IF k10] OR *no wrltemask* 

THEN DEST[i+31 :l] ^ (SRC2[i+31 :l+16]* SRC1 [1+31 :i+16]) + (SRC2[i+15:i]*SRC1 [i+15:1]) 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+31:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C-r-i- Compiler Intrinsic Equivalent 

VPMADDWD _m5121 _mm512_mdd_epi16( _m5121 a, _m512i b); 

VPMADDWD_m5121 _mm512_mask_mdd_epi16(_m512i s,_mmask16 k,_m512i a,_m512i b); 

VPMADDWD_m5121 _mm512_maskz_mdd_epi16(_mmaski 6 k,_m5121 a,_m512i b); 

VPMADDWD_m256i _mm256_mask_mdd_epi16(_m256i s,_mmask8 k,_m256i a,_m256i b); 

VPMADDWD_m256i _mm256_maskz_mdd_epi16(_mmask8 k,_m256i a,_m256i b); 

VPMADDWD_ml 281 _mm_mask_mdd_epi16(_ml 281 s,_mmask8 k,_ml 28i a,_ml 281 b); 

VPMADDWD_m128i_mm_maskz_madd_epi16(_mmask8 k,_ml 281 a,_ml 281 b); 

PMADDWD:_m64_mm_madd_pi16(_m64 ml,_m64 m2) 

(V)PMADDWD:_m1281 _mm_madd_epi16 (_m1281 a,_m1281 b) 

VPMADDWD:_m256i _mm256_madd_epi16 (_m256i a, _m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PMAXSB/PMAXSW/PMAXSD/PMAXSQ-Maximum of Packed Signed Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

OF EE /r' 

PMAXSW mm 7, mm2/m64 

RM 

V/V 

SSE 

Compare signed word integers in mm2/m64and 
mm? and return maximum values. 

66 OF 38 3C/r 

PMAXSB xmmi, xmm2/m128 

RM 

v/v 

SSE4_1 

Compare packed signed byte integers in xmmi and 
xmm2/m128 and store packed maximum values in 
xmmi. 

66 OF EE /r 

PMAXSW xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Compare packed signed word integers in 
xmm2/m128 and xmmi and stores maximum 
packed values in xmmi. 

66 OF 38 3D /r 

PMAXSD xmmi, xmm2/m128 

RM 

v/v 

SSE4_1 

Compare packed signed dword integers in xmmi 
and xmm2/m128 and store packed maximum values 
in xmmi. 

VEX.NDS.128.66.0F38.WIG 3C /r 
VPMAXSB xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed signed byte integers in xmm2 and 
xmm3/m128 and store packed maximum values in 
xmmi. 

VEX.NDS.128.66.0F.WIG EE/r 

VPMAXSW xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed signed word integers in 
xmm3/m128 and xmm2 and store packed maximum 
values in xmmi. 

VEX.NDS.128.66.0F38.WIG 3D /r 
VPMAXSD xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed signed dword integers in xmm2 
and xmm3/m128 and store packed maximum values 
in xmmi. 

VEX.NDS.256.66.0F38.WIG 3C /r 
VPMAXSB ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed signed byte integers in ymm2 and 
ymm3/m256 and store packed maximum values in 
ymmi. 

VEX.NDS.256.66.0F.WIG EE /r 

VPMAXSW ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed signed word integers in 
ymm3/m256 and ymm2 and store packed maximum 
values in ymmi. 

VEX.NDS.256.66.0F38.WIG 3D /r 
VPMAXSD ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed signed dword integers in ymm2 
and ymm3/m256 and store packed maximum values 
in ymmi. 

EVEX.NDS.128.66.0F38.WIG 3C /r 
VPMAXSB xmm1{k1}{z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed signed byte integers in xmm2 and 
xmm3/m128 and store packed maximum values in 
xmmi under writemask k1. 

EVEX.NDS.256.66.0F38.WIG 3C /r 
VPMAXSB ymmi {k1 }{z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed signed byte integers in ymm2 and 
ymm3/m256 and store packed maximum values in 
ymmi under writemask k1. 

EVEX.NDS.51 2.66.0F38.WIG 3C /r 
VPMAXSB zmmi {k1 }[z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed signed byte integers in zmm2 and 
zmm3/m512 and store packed maximum values in 
zmmi under writemask k1. 

EVEX.NDS.128.66.0F.WIG EE /r 

VPMAXSW xmmi [k1 }{z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed signed word integers in xmm2 and 
xmm3/m128 and store packed maximum values in 
xmmi under writemask k1. 

EVEX.NDS.256.66.0F.WIG EE /r 

VPMAXSW ymmi {k1 }[z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed signed word integers in ymm2 and 
ymm3/m256 and store packed maximum values in 
ymmi under writemask k1. 

EVEX.NDS.51 2.66.0F.WIG EE /r 

VPMAXSW zmm1[k1}[z], zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed signed word integers in zmm2 and 
zmm3/m512 and store packed maximum values in 
zmmi under writemask k1. 

EVEX.NDS.128.66.0F38.W0 3D /r 
VPMAXSD xmmi {k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed signed dword integers in xmm2 
and xmm3/m128/m32bcst and store packed 
maximum values in xmmi using writemask k1. 
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Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

EVEX.NDS.256.66.0F38.W0 3D /r 
VPMAXSDymmI {k1}[z}, ymmZ, 
ymm3/m256/m32bcst 

FV 

V/V 

AVX512VL 
AVX512F 

Compare packed signed dword integers in ymm2 
and ymm3/m256/m32bcst and store packed 
maximum values in ymmi using writemask kl. 

EVEX.NDS.512.66.0F38.W0 3D /r 
VPMAXSD zmmi {k1}[z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Compare packed signed dword integers in zmm2 and 
zmm3/m512/m32bcst and store packed maximum 
values in zmmi using writemask kl. 

EVEX.NDS.128.66.0F38.W1 3D/r 
VPMAXSQxmmI [kl }[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

V/V 

AVX512VL 
AVX512F 

Compare packed signed qword integers in xmm2 
and xmm3/m128/m64bcst and store packed 
maximum values in xmmi using writemask kl. 

EVEX.NDS.256.66.0F38.W1 3D /r 
VPMAXSQymmI {k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed signed qword integers in ymm2 
and ymm3/m256/m64bcst and store packed 
maximum values in ymmi using writemask kl. 

EVEX.NDS.512.66.0F38.W1 3D/r 
VPMAXSQzmmI {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Compare packed signed qword integers in zmm2 and 
zmm3/m512/m64bcst and store packed maximum 
values in zmmi using writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed signed byte, word, dword or qword integers in the second source operand 
and the first source operand and returns the maximum value for each pair of integers to the destination operand. 

Legacy SSE version PMAXSW: The source operand can be an MMX technology register or a 64-bit memory location. 
The destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding destination 
register are zeroed. 

EVEX encoded VPMAXSD/Q: The first source operand is a ZMM/YMM/XMM register; The second source operand is 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is conditionally updated based on writemask kl. 

EVEX encoded VPMAXSB/W: The first source operand is a ZMM/YMM/XMM register; The second source operand is 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is conditionally updated 
based on writemask kl. 

Operation 

PMAXSW (64-bit operands) 

IF DEST[15:0] > SRC[15:0]) THEN 
DEST[15:0] ^ DEST[15:0]; 
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ELSE 

DEST[15:0] ^ SRC[15:0]; FI; 

(* Repeat operation for 2nd and 3rd words In source and destination operands *) 

IF DEST[63:48] > SRC[63:48]) THEN 
DEST[63:48] ^ DEST[63:48]; 

ELSE 

DEST[63:48] ^ SRC[63:48]; FI; 

PMAXSB (128-bit Legacy SSE version) 

IF DEST[7:0] >SRC[7:0] THEN 
DEST[7:0] ^DEST[7:0]; 

ELSE 

DEST[7:0] ^SRC[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 
IF DEST[127:120] >SRC[127:120] THEN 
DEST[127:120] ^DEST[127:120]; 

ELSE 

DEST[127:120] ^SRC[127:120]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMAXSB (UEX.128 encoded version) 

IF SRC1 [7:0] >SRC2[7:0] THEN 
DEST[7:0] ^SRCI [7:0]; 

ELSE 

DEST[7:0] ^SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 
IF SRC1 [127:120] >SRC2[127:120] THEN 
DEST[127:120] ^SRCI [127:120]; 

ELSE 

DEST[127:120] ^SRC2[127:120]; FI; 

DEST[MAX_VL-1:128]^0 

VPMAXSB (VEX.256 encoded version) 

IF SRC1 [7:0] >SRC2[7:0] THEN 
DEST[7:0] ^SRCI [7:0]; 

ELSE 

DEST[7:0] ^SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 31 st bytes in source and destination operands *) 
IF SRC1 [255:248] >SRC2[255:248] THEN 
DEST[255:248] ^SRCI [255:248]; 

ELSE 

DEST[255:248] ^SRC2[255:248]; FI; 

DEST[MAX_VL-1:256]^0 

VPMAXSB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^]*8 

IF k1 [j] OR *no writemask* THEN 
IFSRC1[l+7:i]>SRC2[l+7:i] 

THEN DEST[l+7:i] ^ SRC1 [1+7:1]; 

ELSE DEST[i+7:i] ^ SRC2[l+7:i]; 

FI; 

ELSE 
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IF *merglng-masking* ; merging-masking 

TFIEN *DEST[i+7:l] remains unchanged* 

ELSE ; zeroIng-maskIng 

DEST[l+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 


PMAXSW (128-bit Legacy SSE version) 

IFDEST[15:0] >SRC[15:0] THEN 
DEST[15:0] ^DEST[15:0]; 

ELSE 

DEST[15:0] ^SRC[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:112] >SRC[127:112] THEN 
DEST[127:112] ^DEST[127:112]; 

ELSE 

DEST[127:112] ^SRC[127:112]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 


VPMAXSW (VEX.128 encoded version) 

IF SRC1 [15:0] > SRC2[15:0] THEN 
DEST[15:0] ^SRC1[15:0]; 

ELSE 

DEST[15:0] ^SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF SRC1 [127:112] >SRC2[127:112] THEN 
DEST[127:112] ^SRCI [127:112]; 

ELSE 

DEST[127:112] ^SRC2[127:112]; FI; 

DEST[MAX_VL-1:128]^0 


VPMAXSW (VEX.256 encoded version) 

IF SRC1 [15:0] > SRC2[15:0] THEN 
DEST[15:0] ^SRC1[15:0]; 

ELSE 

DEST[15:0] ^SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 15th words in source and destination operands *) 
IF SRC1 [255:240] >SRC2[255:240] THEN 
DEST[255:240] ^SRCI [255:240]; 

ELSE 

DEST[255:240] ^SRC2[255:240]; FI; 

DEST[MAX_VL-1:256]^0 
VPMAXSW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^J* 16 

IF k10] OR *no writemask* THEN 
IFSRC1[i+15:i]>SRC2[i+15:i] 

THEN DEST[i+15:i] ^ SRC1 [i+15:i]; 

ELSE DEST[i+15:i] ^ SRC2[i+15:i]; 

FI; 

ELSE 
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IF *merglng-masklng* ; mergIng-maskIng 

TFIEN *DEST[I+15:1] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

PMAXSD (128-bit Legacy SSE version) 

IF DEST[31:0] >SRC[31:0] THEN 
DEST[31:0] ^DEST[31:0]; 

ELSE 

DEST[31:0] ^SRC[31:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:96] >SRC[127:96] THEN 
DEST[127:96] ^DEST[127:96]; 

ELSE 

DEST[127:96] ^SRC[127:96]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMAXSD (VEX.128 encoded version) 

IFSRC1[31:0] > SRC2[31:0] THEN 
DEST[31:0] ^SRC1[31:0]; 

ELSE 

DEST[31:0] ^SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 3rd dwords in source and destination operands *) 
IF SRC1 [127:96] > SRC2[127:96] THEN 
DEST[127:96] ^SRCI [127:96]; 

ELSE 

DEST[127:96] ^SRC2[127:96]; FI; 

DEST[MAX_VL-1:128]^0 


VPMAXSD (VEX.256 encoded version) 

IF SRC1 [31:0] > SRC2[31:0] THEN 
DEST[31:0] ^SRC1[31:0]; 

ELSE 

DEST[31:0] ^SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 7th dwords in source and destination operands *) 
IF SRC1 [255:224] > SRC2[255:224] THEN 
DEST[255:224] ^SRCI [255:224]; 

ELSE 

DEST[255:224] ^SRC2[255:224]; FI; 

DEST[MAX_VL-1:256]^0 


VPMAXSD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask*THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

IFSRC1[i+31:i]>SRC2[31:0] 

THEN DEST[i+31 :i] ^ SRC1 [i+31 :i]; 
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ELSE DEST[I+31 :l] ^ SRC2[31:0]; 

FI; 

ELSE 

IFSRC1[i+31:i] >SRC2[I+31:I] 

THEN DEST[I+31 :l] ^ SRC1 [i+31 :i]; 

ELSE DEST[I+31:I] ^SRC2[i+31:l]; 

FI; 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE DEST[i+31:l] <-0 ; zeroIng-maskIng 

FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMAXSQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

IF SRC1 [1+63:1] >SRC2[63:0] 

THEN DEST[i+63:i] ^ SRC1 [1+63:1]; 

ELSE DEST[i+63:i] ^ SRC2[63:0]; 

FI; 

ELSE 

IFSRC1[i+63:i] >SRC2[I+63:I] 

THEN DEST[I+63:I] ^ SRC1 [1+63:1]; 

ELSE DEST[I+63:I] ^ SRC2[i+63:l]; 

FI; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMAXSB _m5121 _mm512_max_epi8( _m5121 a, _m5121 b); 

VPMAXSB_m5121 _mm512_mask_max_epi8(_m5121 s,_mmask64 k,_m5121 a,_m5121 b); 

VPMAXSB_m5121 _mm512_maskz_max_epi8(_mmask64 k,_m5121 a,_m5121 b); 

VPMAXSW _m5121 _mm512_max_epi16( _m5121 a, _m5121 b); 

VPMAXSW_m512i_mm512_mask_max_epi16(_m512i s,_mmask32 k,_m512i a,_m512i b); 

VPMAXSW_m512i_mm512_maskz_max_epi16(_mmask32 k,_m5121 a,_m512i b); 

VPMAXSB_m256i _mm256_mask_max_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPMAXSB_m256i _mm256_maskz_max_epi8(_mmask32 k,_m256i a,_m256i b); 

VPMAXSW_m256i_mm256_mask_max_epi16(_m256i s,_mmask16 k,_m256i a,_m256i b); 
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VPMAXSW_m256i _mm256_maskz_max_epi16(_mmaski 6 k,_m256i a,_m256l b); 

VPMAXSB_ml 281 _mm_mask_max_epl8(_ml 281 s,_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPMAXSB_ml 28i _mm_maskz_max_epi8(_mmaski 6 k,_m128i a,_ml 281 b); 

VPMAXSW_ml 28i _mm_mask_max_epi16(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPMAXSW_m128l_mm_maskz_max_epi16(_mmaskS k,_ml 281 a,_ml 281 b); 

VPMAXSD_m256i _mm256_mask_max_epi32(_m256l s,_mmaski 6 k,_m256l a,_m256i b); 

VPMAXSD_m256l _mm256_maskz_max_epl32(_mmaski 6 k,_m256l a,_m256i b); 

VPMAXSQ_m256i _mm256_mask_max_epl64(_m256l s,_mmaskS k,_m256l a,_m256i b); 

VPMAXSQ_m256i _mm256_maskz_max_epl64(_mmaskS k,_m256l a,_m256i b); 

VPMAXSD_m128i_mm_mask_max_epl32(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPMAXSD_ml 281 _mm_maskz_max_epl32(_mmaskS k,_ml 281 a,_ml 281 b); 

VPMAXSQ_ml 28i _mm_mask_max_epl64(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPMAXSQ_ml 28i _mm_maskz_max_epu64(_mmaskS k,_ml 281 a,_ml 281 b); 

VPMAXSD _m512i _mm512_max_epi32(_m5121 a,_m5121 b); 

VPMAXSD_m512i_mm512_mask_max_epi32(_m512l s,_mmaski 6 k,_m512l a,_m512i b); 

VPMAXSD_m5121 _mm512_maskz_max_epl32(_mmaski 6 k,_m512l a,_m512i b); 

VPMAXSQ _m5121 _mm512_max_epi64( _m5121 a_mSI 2i b); 

VPMAXSQ_mSI 2i_mm512_mask_max_epi64(_mSI 21 s,_mmaskS k,_m512i a,_mSI 2i b); 

VPMAXSQ_m512i_mm512_maskz_max_epl64(_mmaskS k,_m512l a,_m512i b); 

(V)PMAXSB_ml 281 _mm_max_epi8 (_m128i a,_ml 281 b); 

(V)PMAXSW _m128i _mm_max_epl16 (_m128i a, _m1281 b) 
(V)PMAXSD_m128l_mm_max_epi32 (_m128i a, _m128i b); 

VPMAXSB_m256i _mm256_max_epl8 (_m256i a,_m256l b); 

VPMAXSW _m256i _mm256_max_epl16 (_m256i a, _m256l b) 

VPMAXSD _m256i _mm256_max_epi32 (_m256l a, _m256i b); 

PMAXSW:_m64 _mm_max_pl16(_m64 a,_m64 b) 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded VPMAXSD/Q, see Exceptions Type E4. 

EVEX-encoded VPMAXSB/W, see Exceptions Type E4.nb. 
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PMAXUB/PMAXUW—Maximum of Packed Unsigned Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF DE /r' 

PMAXUB mm 7, mmZ/m64 

RM 

V/V 

SSE 

Compare unsigned byte integers in mmZ/m64 and 
mm 7 and returns maximum values. 

66 OF DE /r 

PMAXUB xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Compare packed unsigned byte integers in xmmi 
and xmm2/m128 and store packed maximum 
values In xmmi. 

66 OF 38 3E/r 

PMAXUW xmmi, xmm2/m128 

RM 

V/V 

SSE4_1 

Compare packed unsigned word integers in 
xmm2/m128 and xmmi and stores maximum 
packed values In xmmi. 

VEX.NDS.128.66.0F DE /r 

VPMAXUB xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed unsigned byte Integers In xmm2 
and xmm3/m128 and store packed maximum 
values In xmmi. 

VEX.NDS.128.66.0F38 3E/r 

VPMAXUW xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed unsigned word Integers In 
xmm3/m128 and xmm2 and store maximum 
packed values In xmmi. 

VEX.NDS.256.66.0F DE /r 

VPMAXUB ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed unsigned byte Integers In ymm2 
and ymm3/m256 and store packed maximum 
values In ymmi. 

VEX.NDS.256.66.0F38 3E/r 

VPMAXUW ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed unsigned word Integers In 
ymm3/m256 and ymm2 and store maximum 
packed values In ymmi. 

EVEX.NDS.128.66.0F.WIG DE /r 

VPMAXUB xmmi{k1 }[z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned byte integers in xmm2 
and xmm3/m128 and store packed maximum 
values In xmmi under wrltemask k1. 

EVEX.NDS.256.66.0F.WIG DE /r 

VPMAXUB ymmi[k1 }[z], ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned byte integers in ymm2 
and ymm3/m256 and store packed maximum 
values In ymmi under wrltemask k1. 

EVEX.NDS.512.66.0F.WIG DE /r 

VPMAXUB zmmi [k1 }{z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed unsigned byte Integers In zmm2 
and zmm3/m512 and store packed maximum 
values In zmmi under wrltemask k1. 

EVEX.NDS.128.66.0F38.WIG 3E /r 
VPMAXUW xmmi [k1 }{z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned word integers in xmm2 
and xmm3/m128 and store packed maximum 
values In xmmi under wrltemask k1. 

EVEX.NDS.256.66.0F38.WIG 3E /r 
VPMAXUW ymmi {k1 }[z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned word integers in ymm2 
and ymm3/m256 and store packed maximum 
values In ymmi under wrltemask k1. 

EVEX.NDS.512.66.0F38.WIG 3E /r 
VPMAXUW zmmi[k1 }[z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed unsigned word Integers In zmm2 
and zmm3/m512 and store packed maximum 
values In zmmi under wrltemask k1. 

NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX 

Registers" in the Intel" 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 
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Description 

Performs a SIMD compare of the packed unsigned byte, word integers in the second source operand and the first 
source operand and returns the maximum value for each pair of integers to the destination operand. 

Legacy SSE version PMAXUB: The source operand can be an MMX technology register or a 64-bit memory location. 
The destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_\/L-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a 
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated 
based on writemask kl. 

Operation 

PMAXUB (64-bit operands) 

IF DEST[7:0] > SRC[17:0]) THEN 
DEST[7:0] ^ DEST[7:0]; 

ELSE 

DEST[7:0] ^ SRC[7:0]; FI; 

(* Repeat operation for 2nd through 7th bytes in source and destination operands *) 

IF DEST[63:56] > SRC[63:56]) THEN 
DEST[63:56] ^ DEST[63:56]; 

ELSE 

DEST[63:56] ^ SRC[63:56]; FI; 

PMAXUB (128-bit Legacy SSE version) 

IF DEST[7:0] >SRC[7:0] THEN 
DEST[7:0] ^ DEST[7:0]; 

ELSE 

DEST[15:0] ^ SRC[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 

IF DEST[127:120] >SRC[127:120] THEN 
DEST[127:120] ^ DEST[127:120]; 

ELSE 

DEST[127:120] ^ SRC[127:120]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMAXUB (VEX.128 encoded version) 

IF SRC1 [7:0] >SRC2[7:0] THEN 
DEST[7:0] ^ SRC1 [7:0]; 

ELSE 

DEST[7:0] ^ SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 

IF SRC1 [127:120] >SRC2[127:120] THEN 
DEST[127:120] ^ SRC1 [127:120]; 

ELSE 

DEST[127:120] ^ SRC2[127:120]; FI; 

DEST[MAX_VL-1:128]^0 
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VPMAXUB (VEX.256 encoded version) 

IF SRC1 [7:0] >SRC2[7:0] THEN 
DEST[7:0] ^ SRC1 [7:0]; 

ELSE 

DEST[15:0] ^ SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 31st bytes in source and destination operands *) 
IF SRC1 [255:248] >SRC2[255:248] THEN 
DEST[255:248] ^ SRC1 [255:248]; 

ELSE 

DEST[255:248] ^ SRC2[255:248]; FI; 

DEST[MAX_VL-1:128]^0 

VPMAXUB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^]*8 

IF k10] OR *no writemask* THEN 
IFSRC1[i+7:i] >SRC2[i+7:i] 

THEN DEST[i+7:i] ^ SRC1 [i+7:i]; 

ELSE DEST[i+7:i] ^ SRC2[i+7:i]; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

PMAXUW (128-bit Legacy SSE version) 

IF DEST[15:0] >SRC[15:0] THEN 
DEST[15:0] ^ DEST[15:0]; 

ELSE 

DEST[15:0] ^ SRC[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:11 2] >SRC[127:112] THEN 
DEST[127:112] ^ DEST[127:112]; 

ELSE 

DEST[127:112] ^ SRC[127:112]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMAXUW (VEX.128 encoded version) 

IF SRC1 [15:0] > SRC2[15:0] THEN 
DEST[15:0] ^SRCI [15:0]; 

ELSE 

DEST[15:0] ^ SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF SRC1 [127:112] >SRC2[127:112] THEN 
DEST[127:112] ^ SRC1 [127:112]; 

ELSE 

DEST[127:112] ^ SRC2[127:112]; FI; 

DEST[MAX_VL-1:128]^0 
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VPMAXUW (VEX.256 encoded version) 

IF SRC1 [15:0] > SRC2[15:0] THEN 
DEST[15:0] ^ SRC1 [15:0]; 

ELSE 

DEST[15:0] ^ SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 15th words In source and destination operands *) 

IF SRC1 [255:240] >SRC2[255:240] THEN 
DEST[255:240] ^ SRC1 [255:240]; 

ELSE 

DEST[255:240] ^ SRC2[255:240]; FI; 

DEST[MAX_VL-1:128]^0 

VPMAXUW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
I ^j* 16 

IF k1 [j] OR *no writemask* THEN 
IFSRC1[l+15:l]>SRC2[i+15:i] 

THEN DEST[I+15:i] ^ SRC1 [i+15:i]; 

ELSE DEST[i+15:i] ^ SRC2[i+15:i]; 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMAXUB_m5121 _mm512_max_epu8(_m5121 a,_m5121 b); 

VPMAXUB_m5121 _mm512_mask_max_epu8(_m5121 s,_mmask64 k,_m512i a,_m5121 b); 

VPMAXUB_m5121 _mm512_maskz_max_epu8(_mmask64 k,_m5121 a,_m5121 b); 

VPMAXUW _m5121 _mm512_max_epu16(_m5121 a,_m5121 b); 

VPMAXUW_m5121 _mm512_mask_max_epu16(_m512i s,_mmask32 k,_m5121 a,_m512i b); 

VPMAXUW_m512i_mm512_maskz_max_epu16(_mmask32 k,_m512i a,_m512i b); 

VPMAXUB_m256i _mm256_mask_max_epu8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPMAXUB_m256i _mm256_maskz_max_epu8(_mmask32 k,_m256i a,_m256i b); 

VPMAXUW_m256i _mm256_mask_max_epu16(_m256i s,_mmaski 6 k,_m256i a,_m256i b); 

VPMAXUW_m256i_mm256_maskz_max_epu16(_mmaski 6 k,_m256i a,_m256i b); 

VPMAXUB_ml 281 _mm_mask_max_epu8(_ml 281 s,_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPMAXUB_ml 281 _mm_maskz_max_epu8(_mmaski 6 k,_ml 281 a,_ml 28i b); 

VPMAXUW_ml 28i_mm_mask_max_epu16(_ml 28i s,_mmaskB k,_ml 28i a,_ml 281 b); 

VPMAXUW_m128i_mm_maskz_max_epu16(_mmaskB k,_ml 281 a,_m128i b); 

(V)PMAXUB_ml 281 _mm_max_epu8 (_ml 281 a,_m128i b); 

(V)PMAXUW _m1281 _mm_max_epu16 (_m1281 a, _m128i b) 

VPMAXUB _m256i _mm256_max_epu8 (_m256i a, _m256i b); 

VPMAXUW _m256i _mm256_max_epu16 (_m256i a_m256i b); 

PMAXUB:_m64 _mm_max_pu8(_m64 a,_m64 b); 
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SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PMAXUD/PMAXUQ—Maximum of Packed Unsigned Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 3F/r 

PMAXUD xmmi, xmm2/m128 

RM 

V/V 

SSE4_1 

Compare packed unsigned dword integers in xmmi 
and xmm2/m128 and store packed maximum values in 
xmmi. 

VEX.NDS.128.66.0F38.WIG 3F /r 
VPMAXUD xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed unsigned dword integers in xmm2 
and xmm3/m128 and store packed maximum values in 
xmmi. 

VEX.NDS.256.66.0F38.WIG 3F /r 
VPMAXUD ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX2 

Compare packed unsigned dword integers in ymm2 
and ymm3/m256 and store packed maximum values in 
ymmi. 

EVEX.NDS.128.66.0F38.W0 3F /r 
VPMAXUD xmmi {k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed unsigned dword integers in xmm2 
and xmm3/m128/m32bcst and store packed 
maximum values in xmmi under writemask kl. 

EVEX.NDS.256.66.0F38.W0 3F /r 
VPMAXUD ymmi {k1}{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed unsigned dword integers in ymm2 
and ymm3/m256/m32bcst and store packed 
maximum values in ymmi under writemask kl. 

EVEX.NDS.512.66.0F38.W0 3F /r 
VPMAXUD zmmi {k1}[z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Compare packed unsigned dword integers in zmm2 
and zmm3/m512/m32bcst and store packed maximum 
values in zmmi under writemask kl. 

EVEX.NDS.128.66.0F38.W1 3F/r 
VPMAXUQxmmI [kl }[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed unsigned qword integers in xmm2 
and xmm3/m128/m64bcst and store packed 
maximum values in xmmi under writemask kl. 

EVEX.NDS.256.66.0F38.W1 3F/r 
VPMAXUQymmI {k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed unsigned qword integers in ymm2 
and ymm3/m256/m64bcst and store packed 
maximum values in ymmi under writemask kl. 

EVEX.NDS.51 2.66.0F38.W1 3F /r 
VPMAXUQzmmI {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Compare packed unsigned qword integers in zmm2 
and zmm3/m512/m64bcst and store packed maximum 
values in zmmi under writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed unsigned dword or qword integers in the second source operand and the 
first source operand and returns the maximum value for each pair of integers to the destination operand. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register; The second source operand is a VMM register 
or 256-bit memory location. Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is conditionally updated based on writemask kl. 
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Operation 

PMAXUD (128-bit Legacy SSE version) 

IFDEST[31:0] >SRC[31:0] THEN 
DEST[31:0] <- DEST[31:0]; 

ELSE 

DEST[31:0]^SRC[31:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:96] >SRC[127:96] THEN 
DEST[127:96] <r DEST[127:96]; 

ELSE 

DEST[127:96] <r SRC[127:96]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 


VPMAXUD (VEX.128 encoded version) 

IFSRC1[31:0] > SRC2[31:0] THEN 
DEST[31:0]^SRC1[31:0]; 

ELSE 

DEST[31:0]^SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 3rd dwords in source and destination operands *) 
IF SRC1 [1 27:96] > SRC2[127:96] THEN 
DEST[127:96] i- SRC1 [127:96]; 

ELSE 

DEST[127:96] i- SRC2[127:96]; FI; 

DEST[MAX_VL-1:128]^0 


VPMAXUD (VEX.256 encoded version) 

IFSRC1[31:0] > SRC2[31:0] THEN 
DEST[31:0]<-SRC1[31:0]; 

ELSE 

DEST[31:0]<-SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 7th dwords in source and destination operands *) 
IF SRC1 [255:224] > SRC2[255:224] THEN 
DEST[255:224] i- SRC1 [255:224]; 

ELSE 

DEST[255:224] i- SRC2[255:224]; FI; 

DEST[MAX_VL-1:256]^0 
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VPMAXUD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN 

IFSRC1[I+31:I]>SRC2[31:0] 

THEN DEST[I+31 :i] ^ SRC1 [i+31 :i]; 

ELSE DEST[i+31 :l] ^ SRC2[31:0]; 

FI; 

ELSE 

IFSRC1[l+31:l]>SRC2[i+31:i] 

THEN DEST[I+31 :i] ^ SRC1 [i+31 :i]; 

ELSE DEST[i+31:i] ^ SRC2[i+31 :i]; 

FI; 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[i+31:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 


VPMAXUQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

IF SRC1 [1+63:1] >SRC2[63:0] 

THEN DEST[i+63:i] ^ SRC1 [i+63:i]; 

ELSE DEST[i+63:i] ^ SRC2[63:0]; 

FI; 

ELSE 

IFSRC1[i+31:i]>SRC2[i+31:i] 

THEN DEST[i+63:i] ^ SRC1 [i+63:i]; 

ELSE DEST[i+63:i] ^ SRC2[i+63:i]; 

FI; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 
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Intel C/C++ Compiler Intrinsic Equivalent 

VPMAXUD _m5121 _mm512_max_epu32( _m5121 a, _m5121 b); 

VPMAXUD mSI 21 _mm512_mask_max_epu32( m512l s, mmasklE k, m512l a, m512i b); 

VPMAXUD_mSI 21 _mm512_maskz_max_epu32(_mmask16 k,_mSI 21 a,_m5121 b); 

VPMAXUQ_m5121 _mm512_max_epu64(_m512i a, _m5121 b); 

VPMAXUQ_mSI 21 _mm512_mask_max_epu64(_m512l s,_mmaskS k,_m512i a,_m512l b); 

VPMAXUQ_mSI 21 _mm512_maskz_max_epu64(_mmaskS k,_m512i a,_mSI 21 b); 

VPMAXUD m256l _mm256_mask_max_epu32( m256l s, mmaski 6 k, m256l a, m256i b); 

VPMAXUD_m256l _mm256_maskz_max_epu32(_mmaski 6 k,_m256l a,_m256i b); 

VPMAXUQ_m256l _mm256_mask_max_epu64(_m256l s,_mmaskS k,_m256i a,_m256l b); 

VPMAXUQ_m256l _mm256_maskz_max_epu64(_mmaskS k,_m256i a,_m256l b); 

VPMAXUD_ml 281 _mm_mask_max_epu32(_ml 281 s,_mmask8 k,_ml 281 a,_ml 281 b); 

VPMAXUD_ml 281 _mm_maskz_max_epu32(_mmask8 k,_ml 281 a,_ml 281 b); 

VPMAXUQ_ml 281 _mm_mask_max_epu64(_ml 281 s,_mmask8 k,_m128i a,_ml 281 b); 

VPMAXUQ_ml 281 _mm_maskz_max_epu64(_mmask8 k,_ml 281 a,_ml 281 b); 

(V)PMAXUD _m1281 _mm_max_epu32 (_m128i a, _m1281 b); 

VPMAXUD _m256l _mm256_max_epu32 (_m256i a, _m256l b); 


SIMD Floating-Point Exceptions 

None 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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PMINSB/PMINSW—Minimum of Packed Signed Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF EA /r' 

PMINSW mm7, mm2/m64 

RM 

V/V 

SSE 

Compare signed word integers in mm2/m64 and mml 
and return minimum values. 

66 OF 38 38 /r 

PMINSB xmmi, xmm2/m128 

RM 

v/v 

SSE4_1 

Compare packed signed byte integers in xmmi and 
xmm2/m128 and store packed minimum values in 
xmmi. 

66 OF EA /r 

PMINSW xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Compare packed signed word integers in xmm2/m128 
and xmmi and store packed minimum values in xmmi. 

VEX.NDS.128.66.0F38 38 /r 

VPMINSB xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed signed byte integers in xmm2 and 
xmm3/m128 and store packed minimum values in 
xmmi. 

VEX.NDS.128.66.0FEA /r 

VPMINSW xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed signed word integers in xmm3/m128 
and xmm2 and return packed minimum values in 
xmmi. 

VEX.NDS.256.66.0F38 38 /r 

VPMINSB ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed signed byte integers in ymm2 and 
ymm3/m256 and store packed minimum values in 
ymmi. 

VEX.NDS.256.66.0F EA /r 

VPMINSW ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed signed word integers in ymm3/m256 
and ymm2 and return packed minimum values in 
ymmi. 

EVEX.NDS.128.66.0F38.WIG 38/r 
VPMINSB xmmi{k1 }[z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 
AVX512BW 

Compare packed signed byte integers in xmm2 and 
xmm3/m128 and store packed minimum values in 
xmmi under writemaskki. 

EVEX.NDS.256.66.0F38.WIG 38/r 
VPMINSB ymmi {k1 }{z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 
AVX512BW 

Compare packed signed byte integers in ymm2 and 
ymm3/m256 and store packed minimum values in 
ymmi under writemask k1. 

EVEX.NDS.512.66.0F38.WIG 38 /r 
VPMINSB zmmi {k1 }[z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed signed byte integers in zmm2 and 
zmm3/m512 and store packed minimum values in 
zmmi under writemask k1. 

EVEX.NDS.128.66.0F.WIG EA /r 
VPMINSW xmmi {k1 }{z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 
AVX512BW 

Compare packed signed word integers in xmm2 and 
xmm3/m128 and store packed minimum values in 
xmmi under writemaskki. 

EVEX.NDS.256.66.0F.WIG EA /r 
VPMINSW ymmi {k1 }[z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 
AVX512BW 

Compare packed signed word integers in ymm2 and 
ymm3/m256 and store packed minimum values in 
ymmi under writemask k1. 

EVEX.NDS.51 2.66.0F.WIG EA /r 
VPMINSW zmmi[k1 }[z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed signed word integers in zmm2 and 
zmm3/m512 and store packed minimum values in 
zmmi under writemask k1. 

NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume 2A and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX 

Registers" in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 
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Description 

Performs a SIMD compare of the packed signed byte, word, or dword integers in the second source operand and 
the first source operand and returns the minimum value for each pair of integers to the destination operand. 

Legacy SSE version PMINSW: The source operand can be an MMX technology register or a 64-bit memory location. 
The destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_\/L-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a 
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated 
based on writemask kl. 

Operation 

PMINSW (64-bit operands) 

IF DEST[15:0] < SRC[15:0] THEN 
DEST[15:0] ^ DEST[15:0]; 

ELSE 

DEST[15:0]^SRC[15:0]; FI; 

(* Repeat operation for 2nd and 3rd words in source and destination operands *) 

IF DEST[63:48] < SRC[63:48] THEN 
DEST[63:48] ^ DEST[63:48]; 

ELSE 

DEST[63:48] ^ SRC[63:48]; FI; 

PMINSB (128-bit Legacy SSE version) 

IF DEST[7:0] < SRC[7:0] THEN 
DEST[7:0] ^ DEST[7:0]; 

ELSE 

DEST[15:0] ^ SRC[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 

IF DEST[127:120] < SRC[127:120] THEN 
DEST[127:120] ^ DEST[127:120]; 

ELSE 

DEST[127:120] ^ SRC[127:120]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMINSB (VEX.128 encoded version) 

IF SRC1 [7:0] < SRC2[7:0] THEN 
DEST[7:0] ^ SRC1 [7:0]; 

ELSE 

DEST[7:0] ^ SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 

IF SRC1 [127:120] < SRC2[127:120] THEN 
DEST[127:120] ^ SRC1 [127:120]; 

ELSE 

DEST[127:120] ^ SRC2[127:120]; FI; 

DEST[MAX_VL-1:128]^0 


PMINSB/PMINSW—Minimum of Packed Signed Integers 


Vol. 2B 4-321 


INSTRUCTION SET REFERENCE, M-U 


VPMINSB (VEX.256 encoded version) 

IFSRC1[7:0] < SRC2[7:0] THEN 
DEST[7:0] ^ SRC1 [7:0]; 

ELSE 

DEST[15:0] ^ SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 31 st bytes In source and destination operands *) 
IF SRC1 [255:248] < SRC2[255:248] THEN 
DEST[255:248] ^ SRC1 [255:248]; 

ELSE 

DEST[255:248] ^ SRC2[255:248]; FI; 

DEST[MAX_VL-1:256]^0 

VPMINSB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
I ^j*8 

IF k1 [j] OR *no writemask* THEN 
IF SRC1[i+7:i]<SRC2[i+7:i] 

THEN DEST[i+7:i] ^ SRC1 [i+7:i]; 

ELSE DEST[i+7:i] ^ SRC2[i+7:i]; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

PMINSW (128-bit Legacy SSE version) 

IFDEST[15:0] < SRC[15:0] THEN 
DEST[15:0] ^ DEST[15:0]; 

ELSE 

DEST[15:0] ^ SRC[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:112] < SRC[127:112] THEN 
DEST[127:112] ^ DEST[127:112]; 

ELSE 

DEST[127:112] ^ SRC[127:112]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMINSW (VEX.128 encoded version) 

IF SRC1 [15:0] < SRC2[15:0] THEN 
DEST[15:0] ^ SRC1 [15:0]; 

ELSE 

DEST[15:0] ^ SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF SRC1 [127:112] < SRC2[127:112] THEN 
DEST[127:112] ^ SRC1 [127:112]; 

ELSE 

DEST[127:112] ^ SRC2[127:112]; FI; 

DEST[MAX_VL-1:128]^0 
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VPMINSW (VEX.256 encoded version) 

IF SRC1 [15:0] < SRC2[15:0] THEN 
DEST[15:0] ^ SRC1 [15:0]; 

ELSE 

DEST[15:0] ^ SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 15th words in source and destination operands *) 

IF SRC1 [255:240] < SRC2[255:240] THEN 
DEST[255:240] ^ SRC1 [255:240]; 

ELSE 

DEST[255:240] ^ SRC2[255:240]; FI; 

DEST[MAX_VL-1:256]^0 

VPMINSW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^]* 16 

IF kl 0] OR *no writemask* THEN 
IFSRC1[i+15:i]<SRC2[i+15:i] 

THEN DEST[i+15:1] ^ SRC1 [i+15:1]; 

ELSE DEST[i+15:1] ^ SRC2[i+15:1]; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+15:1] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMINSB _m512i _mm512_min_epi8(_m5121 a,_m5121 b); 

VPMINSB_m512i _mm512_mask_min_epi8(_m5121 s,_mmask64 k,_m5121 a,_m5121 b); 

VPMINSB_m512i_mm512_maskz_min_epi8(_mmask64 k,_m5121 a,_m5121 b); 

VPMINSW _m512i _mm512_min_epi16( _m5121 a_m5121 b); 

VPMINSW_m512i _mm512_mask_min_epi16(_m512i s,_mmask32 k,_m512i a,_m512i b); 

VPMINSW_m512i_mm512_maskz_min_epi16(_mmask32 k,_m512i a,_m512i b); 

VPMINSB_m256i _mm256_mask_min_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPMINSB_m256i _mm256_maskz_min_epi8(_mmask32 k,_m256i a,_m256i b); 

VPMINSW_m256i _mm256_mask_min_epi16(_m256i s,_mmaski 6 k,_m256i a,_m256i b); 

VPMINSW_m256i_mm256_maskz_min_epi16(_mmaski 6 k,_m256i a,_m256i b); 

VPMINSB_ml 281 _mm_mask_min_epi8(_ml 28i s,_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPMINSB_ml 281 _mm_maskz_min_epi8(_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPMINSW_ml 28i _mm_mask_min_epi16(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPMINSW_ml 281 _mm_maskz_min_epi16(_mmaskB k,_ml 281 a,_ml 28i b); 

(V)PMINSB_ml 281 _mm_min_epi8 (_ml 281 a,_ml 281 b); 

(V)PMINSW _m1281 _mm_min_epi16 (_m1281 a, _m1281 b) 

VPMINSB _m256i _mm256_min_epi8 (_m256i a_m256i b); 

VPMINSW _m256i _mm256_min_epi16 (_m256i a, _m256i b) 

PMINSW:_m64 _mm_min_pi16 (_m64 a,_m64 b) 
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SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 

#MF (64-bit operations only) If there is a pending x87 FPU exception. 
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PMINSD/PMINSQ—Minimum of Packed Signed Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 39 /r 

PMINSD xmmi, xmm2/m128 

RM 

V/V 

SSE4_1 

Compare packed signed dword integers in xmmi and 
xmm2/m128 and store packed minimum values in 
xmmi. 

VEX.NDS.128.66.0F38.WIG 39 /r 
VPMINSD xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed signed dword integers in xmm2 and 
xmm3/m128 and store packed minimum values in 
xmmi. 

VEX.NDS.256.66.0F38.WIG 39 /r 
VPMINSD ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX2 

Compare packed signed dword integers in ymm2 and 
ymm3/m128 and store packed minimum values in 
ymmi. 

EVEX.NDS.128.66.0F38.W0 39 /r 
VPMINSD xmmi [k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed signed dword integers in xmm2 and 
xmm3/m128 and store packed minimum values in 
xmmi under writemask k1. 

EVEX.NDS.256.66.0F38.W0 39 /r 
VPMINSD ymmi {k1}{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed signed dword integers in ymm2 and 
ymm3/m256 and store packed minimum values in 
ymmi under writemask k1. 

EVEX.NDS.512.66.0F38.W0 39 /r 
VPMINSD zmmi [k1}{z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Compare packed signed dword integers in zmm2 and 
zmm3/m512/m32bcst and store packed minimum 
values in zmmi under writemask k1. 

EVEX.NDS.128.66.0F38.W1 39/r 
VPMINSQxmmI [k1 }[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed signed qword integers in xmm2 and 
xmm3/m128 and store packed minimum values in 
xmmi under writemask k1. 

EVEX.NDS.256.66.0F38.W1 39/r 
VPMINSQymmI {k1 }{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed signed qword integers in ymm2 and 
ymm3/m256 and store packed minimum values in 
ymmi under writemask k1. 

EVEX.NDS.512.66.0F38.W1 39/r 
VPMINSQzmmI {k1}[z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Compare packed signed qword integers in zmm2 and 
zmm3/m512/m64bcst and store packed minimum 
values in zmmi under writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed signed dword or qword integers in the second source operand and the first 
source operand and returns the minimum value for each pair of integers to the destination operand. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding destination 
register are zeroed. 
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EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is conditionally updated based on writemask kl. 

Operation 

PMINSD (128-bit Legacy SSE version) 

IFDEST[31:0] < SRC[31:0] THEN 
DEST[31:0]^DEST[31:0]; 

ELSE 

DEST[31:0] ^SRC[31:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 

IF DEST[127:96] < SRC[127:96] THEN 
DEST[127:96] <- DEST[127:96]; 

ELSE 

DEST[127:96] <- SRC[127:96]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMINSD (VEX.128 encoded version) 

IFSRC1[31:0] < SRC2[31:0] THEN 
DEST[31:0] ^SRC1[31:0]; 

ELSE 

DEST[31:0] ^SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 3rd dwords in source and destination operands *) 

IF SRC1 [127:96] < SRC2[127:96] THEN 
DEST[127:96] <- SRC1 [127:96]; 

ELSE 

DEST[127:96] <- SRC2[127:96]; FI; 

DEST[MAX_VL-1:128]^0 

VPMINSD (VEX.256 encoded version) 

IFSRC1[31:0] < SRC2[31:0] THEN 
DEST[31:0] ^SRC1[31:0]; 

ELSE 

DEST[31:0] ^SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 7th dwords in source and destination operands *) 

IF SRC1 [255:224] < SRC2[255:224] THEN 
DEST[255:224] <- SRC1 [255:224]; 

ELSE 

DEST[255:224] <- SRC2[255:224]; FI; 

DEST[MAX_VL-1:256]^0 
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VPMINSD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR) ^0 TO KL-1 
i^j*32 

IF k10] OR *no wrltemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

IFSRC1[l+31:i]<SRC2[31:0] 

THEN DEST[I+31 :l] ^ SRC1 [i+31 :i]; 

ELSE DEST[I+31 :l] ^ SRC2[31:0]; 

FI; 

ELSE 

IFSRC1[i+31:i]<SRC2[l+31:l] 

THEN DEST[i+31 :l] ^ SRC1 [i+31 :i]; 

ELSE DEST[i+31:i] ^SRC2[i+31:i]; 

FI; 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 


VPMINSQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FORj^OTO KL-1 
i ^ j * 64 

IF k10] OR *no wrltemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

IFSRC1[i+63:i]<SRC2[63:0] 

THEN DEST[i+63:i] ^ SRC1 [i+63:i]; 

ELSE DEST[i+63:i] ^ SRC2[63:0]; 

FI; 

ELSE 

IFSRC1[i+63:i]<SRC2[i+63:i] 

THEN DEST[i+63:i] ^ SRC1 [i+63:i]; 

ELSE DEST[i+63:i] ^ SRC2[i+63:i]; 

FI; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 
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Intel C/C++ Compiler Intrinsic Equivaient 

VPMINSD _m512i _mm512_min_epi32( _m5121 a, _m5121 b); 

VPMINSD_mSI 2i _mm512_mask_min_epi32(_mSI 21 s,_mmaski 6 k,_mSI 2i a,_mSI 21 b); 

VPMINSD_mSI 2i _mm512_maskz_min_epi32(_mmaski 6 k,_mSI 21 a,_mSI 21 b); 

VPMINSQ_m5121 _mm512_min_epl64(_m5121 a_mSI 21 b); 

VPMINSQ_mSI 2i _mm512_mask_mln_epi64(_m512i s,_mmaskS k,_m512l a,_m512i b); 

VPMINSQ_mSI 2i _mm512_maskz_min_epl64(_mmaskS k,_mSI 21 a,_mSI 21 b); 

VPMINSD_m256i _mm256_mask_mln_epl32(_m256l s,_mmaski 6 k,_m256i a,_m256l b); 

VPMINSD_m256i _mm256_maskz_min_epi32(_mmaski 6 k,_m256l a,_m256l b); 

VPMINSQ_m256i _mm256_mask_mln_epi64(_m256i s,_mmaskS k,_m256l a,_m256i b); 

VPMINSQ_m256i _mm256_maskz_min_epl64(_mmaskS k,_m256l a,_m256i b); 

VPMINSD_ml 2Si _mm_mask_min_epi32(_m12SI s,_mmaskS k,_m12SI a,_m12Si b); 

VPMINSD_ml 2Si _mm_maskz_mln_epl32(_mmaskS k,_ml 2SI a,_ml 2Si b); 

VPMINSQ_ml 2Si _mm_mask_min_epi64(_ml 2SI s,_mmaskS k,_ml 2SI a,_ml 2SI b); 

VPMINSQ_ml 2SI _mm_maskz_mln_epu64(_mmaskS k,_m12SI a,_m12SI b); 

(V)PMINSD_ml 2SI _mm_mln_epl32 (_ml 2SI a,_ml 2SI b); 

VPMINSD _m256l_mm256_mln_epl32 (_m256l a,_m256l b); 


SIMD Floating-Point Exceptions 

None 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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PMINUB/PMINUW—Minimum of Packed Unsigned Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF DA /r' 

PMINUB mm 7, mm2/m64 

RM 

V/V 

SSE 

Compare unsigned byte integers in mm2/m64 and 
mm? and returns minimum values. 

66 OF DA /r 

PMINUB xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Compare packed unsigned byte integers in xmmi 
and xmm2/m128 and store packed minimum values 
in xmmi. 

66 OF 38 3A/r 

PMINUW xmmi, xmm2/m128 

RM 

V/V 

SSE4_1 

Compare packed unsigned word integers in 
xmm2/m128 and xmmi and store packed minimum 
values in xmmi. 

VEX.NDS.128.66.0F DA /r 

VPMINUB xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed unsigned byte integers in xmm2 
and xmm3/m128 and store packed minimum values 
in xmmi. 

VEX.NDS.128.66.0F38 3A/r 

VPMINUW xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Compare packed unsigned word integers in 
xmm3/m128 and xmm2 and return packed 
minimum values in xmmi. 

VEX.NDS.256.66.0F DA /r 

VPMINUB ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed unsigned byte integers in ymm2 
and ymm3/m256 and store packed minimum values 
in ymmi. 

VEX.NDS.256.66.0F38 3A/r 

VPMINUW ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Compare packed unsigned word integers in 
ymm3/m256 and ymm2 and return packed 
minimum values In ymmi. 

EVEX.NDS.128.66.0F DA /r 

VPMINUB xmmi {k1 }[z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned byte Integers In xmm2 
and xmm3/m128 and store packed minimum values 
in xmmi under writemask k1. 

EVEX.NDS.256.66.0F DA /r 

VPMINUB ymmi {k1}{z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned byte integers in ymm2 
and ymm3/m256 and store packed minimum values 
in ymmi under writemask k1. 

EVEX.NDS.512.66.0F DA /r 

VPMINUB zmmi {k1}[z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed unsigned byte integers in zmm2 
and zmm3/m512 and store packed minimum values 
in zmmi under writemask k1. 

EVEX.NDS.128.66.0F38 3A/r 

VPMINUW xmmi {k1 }{z}, xmm2, 
xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned word integers in 
xmm3/m128 and xmm2 and return packed 
minimum values In xmmi under writemask k1. 

EVEX.NDS.256.66.0F38 3A/r 

VPMINUW ymmi {k1 }{z}, ymm2, 
ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Compare packed unsigned word Integers in 
ymm3/m256 and ymm2 and return packed 
minimum values in ymmi under writemask k1. 

EVEX.NDS.512.66.0F38 3A/r 

VPMINUW zmmi[k1 }[z}, zmm2, 
zmm3/m512 

FVM 

v/v 

AVX512BW 

Compare packed unsigned word integers in 
zmm3/m512 and zmm2 and return packed 
minimum values In zmmi under writemask k1. 

NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume 2A and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX 

Registers" in the Intel" 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 
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Description 

Performs a SIMD compare of the packed unsigned byte or word integers in the second source operand and the first 
source operand and returns the minimum value for each pair of integers to the destination operand. 

Legacy SSE version PMINUB: The source operand can be an MMX technology register or a 64-bit memory location. 
The destination operand can be an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_\/L-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a 
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated 
based on writemask kl. 

Operation 

PMINUB (for 64-bit operands) 

IF DEST[7:0] < SRC[17:0] THEN 
DEST[7:0] ^ DEST[7:0]; 

ELSE 

DEST[7:0] ^ SRC[7:0]; FI; 

(* Repeat operation for 2nd through 7th bytes in source and destination operands *) 

IF DEST[63:56] < SRC[63:56] THEN 
DEST[63:56] ^ DEST[63:56]; 

ELSE 

DEST[63:56] ^ SRC[63:56]; FI; 

PMINUB instruction for 128-bit operands: 

IF DEST[7:0] < SRC[7:0] THEN 
DEST[7:0] ^ DEST[7:0]; 

ELSE 

DEST[15:0] ^ SRC[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 

IF DEST[127:120] < SRC[127:120] THEN 
DEST[127:120] ^ DEST[127:120]; 

ELSE 

DEST[127:120] ^ SRC[127:120]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMINUB (VEX.128 encoded version) 

IFSRC1[7:0] < SRC2[7:0] THEN 
DEST[7:0] ^ SRC1 [7:0]; 

ELSE 

DEST[7:0] ^ SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 15th bytes in source and destination operands *) 

IF SRC1 [127:120] < SRC2[127:120] THEN 
DEST[127:120] ^ SRC1 [127:120]; 

ELSE 

DEST[127:120] ^ SRC2[127:120]; FI; 

DEST[MAX_VL-1:128]^0 
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VPMINUB (VEX.ZSe encoded version) 

IF SRC1 [7:0] < SRC2[7:0] THEN 
DEST[7:0] ^ SRC1 [7:0]; 

ELSE 

DEST[15:0] ^ SRC2[7:0]; FI; 

(* Repeat operation for 2nd through 31st bytes in source and destination operands *) 
IF SRC1 [255:248] < SRC2[255:248] THEN 
DEST[255:248] ^ SRC1 [255:248]; 

ELSE 

DEST[255:248] ^ SRC2[255:248]; FI; 

DEST[MAX_VL-1:256]^0 

VPMINUB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^]*8 

IF k10] OR *no writemask* THEN 
IFSRC1[i+7:i]<SRC2[i+7:i] 

THEN DEST[i+7:i] ^ SRC1 [i+7:i]; 

ELSE DEST[i+7:i] ^ SRC2[i+7:i]; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

PMINUW instruction for 128-bit operands: 

IF DEST[15:0] < SRC[15:0] THEN 
DEST[15:0] ^ DEST[15:0]; 

ELSE 

DEST[15:0] ^ SRC[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:11 2] < SRC[127:112] THEN 
DEST[127:112] ^ DEST[127:112]; 

ELSE 

DEST[127:112] ^ SRC[127:112]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 

VPMINUW (VEX.128 encoded version) 

IF SRC1 [15:0] < SRC2[15:0] THEN 
DEST[15:0] ^SRCI [15:0]; 

ELSE 

DEST[15:0] ^ SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF SRC1 [127:112] < SRC2[127:112] THEN 
DEST[127:112] ^ SRC1 [127:112]; 

ELSE 

DEST[127:112] ^ SRC2[127:112]; FI; 

DEST[MAX_VL-1:128]^0 
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VPMINUW (VEX.256 encoded version) 

IF SRC1 [15:0] < SRC2[15:0] THEN 
DEST[15:0] ^ SRC1 [15:0]; 

ELSE 

DEST[15:0] ^ SRC2[15:0]; FI; 

(* Repeat operation for 2nd through 15th words In source and destination operands *) 

IF SRC1 [255:240] < SRC2[255:240] THEN 
DEST[255:240] ^ SRC1 [255:240]; 

ELSE 

DEST[255:240] ^ SRC2[255:240]; FI; 

DEST[MAX_VL-1:256]^0 

VPMINUW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
I ^j* 16 

IF k1 [j] OR *no writemask* THEN 
IFSRC1[I+15:I] < SRC2[i+15:i] 

THEN DEST[I+15:i] ^ SRC1 [i+15:i]; 

ELSE DEST[i+15:i] ^ SRC2[i+15:i]; 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMINUB _m5121 _mm512_min_epu8( _m512i a, _m5121 b); 

VPMINUB_m512i _mm512_mask_min_epu8(_m5121 s,_mmask64 k,_m512i a,_m5121 b); 

VPMINUB_m512i _mm512_maskz_min_epu8(_mmask64 k,_m512i a,_m512i b); 

VPMINUW _m5121 _mm512_min_epu16(_m5121 a,_m5121 b); 

VPMINUW_m5121 _mm512_mask_min_epu16(_m512i s,_mmask32 k,_m5121 a,_m5121 b); 

VPMINUW_m512i_mm512_maskz_min_epu16(_mmask32 k,_m512l a,_m512l b); 

VPMINUB_m256i _mm256_mask_mln_epu8(_m256i s,_mmask32 k,_m256i a,_m256l b); 

VPMINUB_m256i _mm256_maskz_min_epu8(_mmask32 k,_m256i a,_m256i b); 

VPMINUW_m256l _mm256_mask_min_epu16(_m256i s,_mmaski 6 k,_m256i a,_m256l b); 

VPMINUW_m256i_mm256_maskz_min_epu16(_mmaski 6 k,_m256l a,_m256l b); 

VPMINUB_ml 281 _mm_mask_mln_epu8(_ml 281 s,_mmaski 6 k,_ml 28i a,_ml 281 b); 

VPMINUB_m128l_mm_maskz_mln_epu8(_mmaski 6 k,_ml 281 a,_m128i b); 

VPMINUW_ml 28i _mm_mask_min_epu16(_ml 28i s,_mmask8 k,_ml 281 a,_ml 28i b); 

VPMINUW_m128i_mm_maskz_mln_epu16(_mmask8 k,_ml 281 a,_ml 281 b); 

(V)PMINUB_ml 281 _mm_mln_epu8 (_ml 28i a,_ml 281 b) 

(V)PMINUW _m1281 _mm_mln_epu16 (_m1281 a, _m1281 b); 

VPMINUB _m256i _mm256_mln_epu8 (_m256l a_m256i b) 

VPMINUW _m256l _mm256_min_epu16 (_m256l a_m256i b); 

PMINUB:_m64 _m_mln_pu8 (_m64 a,_m64 b) 


4-332 Vol. 2B 


PMINUB/PMINUW—Minimum of Packed Unsigned Integers 


INSTRUCTION SET REFERENCE, M-U 


SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PMINUD/PMINUQ—Minimum of Packed Unsigned Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 3B /r 

PMINUD xmmi, xmm2/m128 

RM 

V/V 

SSE4_1 

Compare packed unsigned dword integers in xmmi and 
xmm2/m128 and store packed minimum values in xmmi. 

VEX.NDS.128.66.0F38.WIG 3B /r 
VPMINUD xmmi, xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Compare packed unsigned dword integers in xmm2 and 
xmm3/m128 and store packed minimum values in xmmi. 

VEX.NDS.256.66.0F38.WIG 3B /r 
VPMINUD ymmi, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX2 

Compare packed unsigned dword integers in ymm2 and 
ymm3/m256 and store packed minimum values in ymmi. 

EVEX.NDS.128.66.0F38.W0 3B /r 
VPMINUD xmmi [k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Compare packed unsigned dword integers in xmm2 and 
xmm3/m128/m32bcst and store packed minimum values 
in xmmi under writemask kl. 

EVEX.NDS.256.66.0F38.W0 3B /r 
VPMINUD ymmi {k1}[z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed unsigned dword integers in ymm2 and 
ymm3/m256/m32bcst and store packed minimum values 
in ymmi under writemask kl. 

EVEX.NDS.512.66.0F38.W0 3B /r 
VPMINUD zmmi {k1}{z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Compare packed unsigned dword integers in zmm2 and 
zmm3/m512/m32bcst and store packed minimum values 
in zmmi under writemask kl. 

EVEX.NDS.128.66.0F38.W1 3B/r 
VPMINUQxmmI [k1}[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Compare packed unsigned qword integers in xmm2 and 
xmm3/m128/m64bcst and store packed minimum values 
in xmmi under writemask kl. 

EVEX.NDS.256.66.0F38.W1 3B /r 
VPMINUQymmI {k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Compare packed unsigned qword integers in ymm2 and 
ymm3/m256/m64bcst and store packed minimum values 
in ymmi under writemask kl. 

EVEX.NDS.51 2.66.0F38.W1 3B /r 
VPMINUQzmmI {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Compare packed unsigned qword integers in zmm2 and 
zmm3/m512/m64bcst and store packed minimum values 
in zmmi under writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD compare of the packed unsigned dword/qword integers in the second source operand and the first 
source operand and returns the minimum value for each pair of integers to the destination operand. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination 
register are zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding destination 
register are zeroed. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is conditionally updated based on writemask kl. 
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Operation 

PMINUD (128-bit Legacy SSE version) 

PMINUD Instruction for 128-bit operands: 

IFDEST[31:0] < SRC[31:0] THEN 
DEST[31:0] <- DEST[31:0]; 

ELSE 

DEST[31:0]^SRC[31:0]; FI; 

(* Repeat operation for 2nd through 7th words in source and destination operands *) 
IF DEST[127:96] < SRC[127:96] THEN 
DEST[127:96] <r DEST[127:96]; 

ELSE 

DEST[127:96] <r SRC[127:96]; FI; 

DEST[MAX_VL-1:128] (Unmodified) 


VPMINUD (VEX.128 encoded version) 

VPMINUD instruction for 128-bit operands: 

IFSRC1[31:0] < SRC2[31:0] THEN 
DEST[31:0]^SRC1[31:0]; 

ELSE 

DEST[31:0]<-SRC2[31:0];FI; 

(* Repeat operation for 2nd through 3rd dwords in source and destination operands *) 
IF SRC1 [1 27:96] < SRC2[127:96] THEN 
DEST[127:96] i- SRC1 [127:96]; 

ELSE 

DEST[127:96] i- SRC2[127:96]; FI; 

DEST[MAX_VL-1:128]^0 


VPMINUD (VEX.256 encoded version) 

VPMINUD instruction for 128-bit operands: 

IFSRC1[31:0] < SRC2[31:0] THEN 
DEST[31:0]<-SRC1[31:0]; 

ELSE 

DEST[31:0]<-SRC2[31:0]; FI; 

(* Repeat operation for 2nd through 7th dwords in source and destination operands *) 
IF SRC1 [255:224] < SRC2[255:224] THEN 
DEST[255:224] i- SRC1 [255:224]; 

ELSE 

DEST[255:224] ^ SRC2[255:224]; FI; 

DEST[MAX_VL-1:256]^0 
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VPMINUD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN 

IF SRC1 [1+31:1] < SRC2[31:0] 

THEN DEST[I+31 :i] ^ SRC1 [i+31 :i]; 

ELSE DESTp+31 :l] ^ SRC2[31:0]; 

FI; 

ELSE 

IF SRC1 [1+31:1] < SRC2[i+31:i] 

THEN DEST[I+31 :i] ^ SRC1 [i+31 :i]; 

ELSE DESTp+31 :i] ^ SRC2[i+31 :i]; 

FI; 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 


VPMINUQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 p] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN 

IF SRC1 [1+63:1] < SRC2[63:0] 

THEN DEST[i+63:i] ^ SRC1 [i+63:i]; 

ELSE DEST[i+63:i] ^ SRC2[63:0]; 

FI; 

ELSE 

IF SRC1 [1+63:1] < SRC2[i+63:i] 

THEN DEST[i+63:i] ^ SRC1 [i+63:i]; 

ELSE DEST[i+63:i] ^ SRC2[i+63:i]; 

FI; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 
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Intel C/C++ Compiler Intrinsic Equivalent 

VPMINUD _m5121 _mm512_mln_epu32( _m5121 a, _m512i b); 

VPMINUD_mSI 21 _mm512_masl<_mln_epu32(_m512l s,_mmasklE k,_m512l a,_m512l b); 

VPMINUD_mSI 21 _mm512_maskz_mln_epu32(_mmaski 6 k,_mSI 21 a,_m5121 b); 

VPMINUQ_m5121 _mm512_mln_epu64(_m5121 a, _m512i b); 

VPMINUQ_mSI 21 _mm512_mask_mln_epu64(_m512l s,_mmaskS k,_m512l a,_m512l b); 

VPMINUQ_mSI 21 _mm512_maskz_mln_epu64(_mmaskS k,_mSI 21 a,_m5121 b); 

VPMINUD_m256l _mm256_mask_mln_epu32(_m256l s,_mmaski 6 k,_m256l a,_m256l b); 

VPMINUD_m256l _mm256_maskz_mln_epu32(_mmaski 6 k,_m256l a,_m256l b); 

VPMINUQ_m256l _mm256_mask_mln_epu64(_m256l s,_mmaskS k,_m256l a,_m256l b); 

VPMINUQ_m256l _mm256_maskz_mln_epu64(_mmaskS k,_m256l a,_m256i b); 

VPMINUD_ml 2SI _mm_mask_mln_epu32(_ml 2SI s,_mmaskS k,_ml 2SI a,_ml 2SI b); 

VPMINUD_ml 2SI _mm_maskz_mln_epu32(_mmaskS k,_m12SI a,_m12Si b); 

VPMINUQ_ml 2SI _mm_mask_mln_epu64(_m12SI s,_mmaskS k,_m12Si a,_m12SI b); 

VPMINUQ_ml 2Si _mm_maskz_min_epu64(_mmaskS k,_ml 2Si a,_ml 2Si b); 

(V)PMINUD _m12SI _mm_mln_epu32 (_m12SI a, _m12SI b); 

VPMINUD _m256l _mm256_mln_epu32 (_m256l a, _m256l b); 


SIMD Floating-Point Exceptions 

None 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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PMOVMSKB-Move Byte Mask 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF D7 /r' 

PMOVMSKB reg, mm 

RM 

V/V 

SSE 

Move a byte mask of mm to reg. The upper 
bits of r32 or r64 are zeroed 

66 OF 07 /r 

PMOVMSKB reg, xmm 

RM 

v/v 

SSE2 

Move a byte mask of xmm to reg. The upper 
bits of r32 or r64 are zeroed 

VEX.128.66.0F.WIGD7 /r 

VPMOVMSKB reg, xmmi 

RM 

V/V 

AVX 

Move a byte mask of xmm 1 to reg. The upper 
bits of r32 or r64 are filled with zeros. 

VEX.256.66.0F.WIG 07 /r 

VPMOVMSKB reg, ymm 7 

RM 

v/v 

AVX2 

Move a 32-bit mask of ymm 7 to reg. The 
upper bits of r64 are filled with zeros. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Creates a mask made up of the most significant bit of each byte of the source operand (second operand) and stores 
the result in the low byte or word of the destination operand (first operand). 

The byte mask is 8 bits for 64-bit source operand, 16 bits for 128-bit source operand and 32 bits for 256-bit source 
operand. The destination operand is a general-purpose register. 

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R 
prefix. The default operand size is 64-bit in 64-bit mode. 

Legacy SSE version: The source operand is an MMX technology register. 

128-bit Legacy SSE version: The source operand is an XMM register. 

VEX. 128 encoded version: The source operand is an XMM register. 

VEX.256 encoded version: The source operand is a VMM register. 

Note: VEX.vvvv is reserved and must be 1111b. 

Operation 

PMOVMSKB (with 64-bit source operand and r32) 

r32[0] ^ SRC[7]; 
r32[1]^SRC[15]; 

(* Repeat operation for bytes 2 through 6 *) 
r32[7] ^ SRC[63]; 
r32[31:8]^ZER0_FILL; 

(V)PMOVMSKB (with 128-bit source operand and r32) 

r32[0] ^ SRC[7]; 
r32[1]^SRC[15]; 

(* Repeat operation for bytes 2 through 14 *) 

r32[15]^SRC[127]; 

r32[31:16]^ZER0_FILL; 


4-338 Vol. 28 


PMOVMSKB-Move Byte Mask 

















INSTRUCTION SET REFERENCE, M-U 


VPMOVMSKB (with 256-bit source operand and r32) 

r32[0] ^ SRC[7]; 
r32[1]^SRC[15]; 

(* Repeat operation for bytes 3rd through 31 *) 
r32[31] ^SRC[255]; 

PMOVMSKB (with 64-bit source operand and r64) 

r64[0] ^ SRC[7]; 
r64[1]^SRC[15]; 

(* Repeat operation for bytes 2 through 6 *) 
r64[7] ^ SRC[63]; 
r64[63:8] ^ ZERO_FILL; 

(VjPMOVMSKB (with 128-bit source operand and r64) 

r64[0] ^ SRC[7]; 
r64[1]^SRC[15]; 

(* Repeat operation for bytes 2 through 14 *) 
r64[15] ^ SRC[127]; 
r64[63:16]^ZER0_FILL; 

VPMOVMSKB (with 256-bit source operand and r64) 

r64[0] ^ SRC[7]; 
r64[1]eSRC[15]; 

(* Repeat operation for bytes 2 through 31 *) 
r64[31] ^SRC[255]; 
r64[63:32] ^ ZERO_FILL; 

Intel C/C++ Compiler Intrinsic Equivalent 

PMOVMSKB: int _mm_movemask_pi8( rTi64 a) 

(V)PMOVMSKB: int _mm_movemask_epi8 (_ml 28i a) 

VPMOVMSKB: int _mm256_movemask_epi8 ( m256i a) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

See Exceptions Type 7; additionally 
#UD If VEX.vvvv iiiiB. 
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PMOVSX—Packed Move with Sign Extend 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

66 Of 38 20 /r 

PMOVSXBW xmmi, xnnm2/m64 

RM 

V/V 

SSE4_1 

Sign extend 8 packed 8-bit integers in the low 8 bytes 
of xmm2/m64 to 8 packed 16-bit integers in xmmi. 

66 Of 38 21 /r 

PMOVSXBD xmmi, xmm2/m32 

RM 

v/v 

SSE4_1 

Sign extend 4 packed 8-bit integers in the low 4 bytes 
of xmm2/m32 to 4 packed 32-bit integers in xmmi. 

66 Of 38 22 /r 

PMOVSXBQ xmmi, xmm2/m16 

RM 

V/V 

SSE4_1 

Sign extend 2 packed 8-bit integers in the low 2 bytes 
of xmm2/m16 to 2 packed 64-bit integers in xmmi. 

66 Of 38 23/r 

PMOVSXWD xmmi, xmm2/m64 

RM 

v/v 

SSE4_1 

Sign extend 4 packed 16-bit integers in the low 8 bytes 
of xmm2/m64 to 4 packed 32-bit integers in xmmi. 

66 Of 38 24 /r 

PMOVSXWQ xmm 1, xmm2/m32 

RM 

v/v 

SSE4_1 

Sign extend 2 packed 16-bit integers in the low 4 bytes 
of xmm2/m32 to 2 packed 64-bit integers in xmmi. 

66 Of 38 25 /r 

PMOVSXDQ xmmi, xmm2/m64 

RM 

v/v 

SSE4_1 

Sign extend 2 packed 32-bit integers in the low 8 bytes 
of xmm2/m64 to 2 packed 64-bit integers in xmmi. 

VEX.128.66.0F38.WIG 20 /r 
VPMOVSXBW xmmi, xmm2/m64 

RM 

v/v 

AVX 

Sign extend 8 packed 8-bit integers in the low 8 bytes 
of xmm2/m64 to 8 packed 16-bit integers in xmmi. 

VEX.128.66.0F38.WIG21 /r 
VPMOVSXBD xmmi, xmm2/m32 

RM 

v/v 

AVX 

Sign extend 4 packed 8-bit integers in the low 4 bytes 
of xmm2/m32 to 4 packed 32-bit integers in xmmi. 

VEX.128.66.0F38.WIG22/r 
VPMOVSXBQ xmm 1, xmm2/m 16 

RM 

v/v 

AVX 

Sign extend 2 packed 8-bit integers in the low 2 bytes 
of xmm2/m16 to 2 packed 64-bit integers in xmmi. 

VEX.128.66.0F38.WIG23/r 
VPMOVSXWD xmmi, xmm2/m64 

RM 

v/v 

AVX 

Sign extend 4 packed 16-bit integers in the low 8 bytes 
of xmm2/m64 to 4 packed 32-bit integers in xmmi. 

VEX.128.66.0F38.WIG 24/r 
VPMOVSXWQ xmmi, xmm2/m32 

RM 

v/v 

AVX 

Sign extend 2 packed 16-bit integers in the low 4 bytes 
of xmm2/m32 to 2 packed 64-bit integers in xmmi. 

VEX.128.66.0F38.WIG25/r 
VPMOVSXDQ xmmi, xmm2/m64 

RM 

v/v 

AVX 

Sign extend 2 packed 32-bit integers in the low 8 bytes 
of xmm2/m64 to 2 packed 64-bit integers in xmmi. 

VEX.256.66.0F38.WIG 20 /r 
VPMOVSXBW ymmi, xmm2/m128 

RM 

v/v 

AVX2 

Sign extend 16 packed 8-bit integers in xmm2/m128 to 

16 packed 16-bit integers in ymmi. 

VEX.256.66.0F38.WIG 21 /r 
VPMOVSXBD ymm1,xmm2/m64 

RM 

v/v 

AVX2 

Sign extend 8 packed 8-bit integers in the low 8 bytes 
of xmm2/m64 to 8 packed 32-bit integers in ymmi. 

VEX.256.66.0F38.WIG 22 /r 
VPMOVSXBQ ymmi, xmm2/m32 

RM 

v/v 

AVX2 

Sign extend 4 packed 8-bit integers in the low 4 bytes 
of xmm2/m32 to 4 packed 64-bit integers in ymmi. 

VEX.256.66.0F38.WIG 23 /r 
VPMOVSXWD ymmi, xmm2/m128 

RM 

v/v 

AVX2 

Sign extend 8 packed 16-bit integers in the low 16 
bytes of xmm2/m128 to 8 packed 32-bit integers in 
ymmi. 

VEX.256.66.0F38.WIG 24/r 
VPMOVSXWQ ymmi, xmm2/m64 

RM 

v/v 

AVX2 

Sign extend 4 packed 16-bit integers in the low 8 bytes 
of xmm2/m64 to 4 packed 64-bit integers in ymmi. 

VEX.256.66.0F38.WIG 25 /r 
VPMQVSXDQ ymmi, xmm2/m128 

RM 

v/v 

AVX2 

Sign extend 4 packed 32-bit integers in the low 16 
bytes of xmm2/m128 to 4 packed 64-bit integers in 
ymmi. 

EVEX.128.66.0F38.WIG 20 /r 
VPMOVSXBW xmmi {k1}{z}, 
xmm2/m64 

HVM 

v/v 

AVX512VL 

AVX512BW 

Sign extend 8 packed 8-bit integers in xmm2/m64 to 8 
packed 16-bit integers in zmmi. 

EVEX.256.66.0F38.WIG 20 /r 
VPMOVSXBW ymmi [k1}[z}, 
xmm2/m128 

HVM 

v/v 

AVX512VL 

AVX512BW 

Sign extend 16 packed 8-bit integers in xmm2/m128 to 

16 packed 16-bit integers in ymmi. 

EVEX.512.66.0F38.WIG 20 /r 
VPMOVSXBW zmmi {k1}{z}, 
ymm2/m256 

HVM 

v/v 

AVX512BW 

Sign extend 32 packed 8-bit integers in ymm2/m256 to 
32 packed 16-bit integers in zmmi. 

EVEX.128.66.0F38.WIG21 /r 
VPMOVSXBD xmmi [k1}{z}, 
xmm2/m32 

QVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 4 packed 8-bit integers in the low 4 bytes 
of xmm2/m32 to 4 packed 32-bit integers in xmmi 
subject to writemask k1. 
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Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

EVEX.256.66.0F38.WIG 21 /r 
VPMOVSXBDymmI {k1}[z}, 
xmm2/m64 

QVM 

V/V 

AVX512VL 

AVX512F 

Sign extend 8 packed 8-bit integers in the low 8 bytes 
of xmm2/m64 to 8 packed 32-bit integers in ymmi 
subject to writemask k1. 

EVEX.512.66.0F38.WIG21 /r 
VPMOVSXBD zmmi {k1}{z}, 
xmm2/m128 

QVM 

v/v 

AVX512F 

Sign extend 16 packed 8-bit integers in the low 16 
bytes of xmm2/m128 to 16 packed 32-bit integers in 
zmmi subject to writemask k1. 

EVEX.128.66.0F38.WIG22 /r 
VPMOVSXBQxmmI {k1}{z}, 
xmm2/m16 

OVM 

V/V 

AVX512VL 

AVX512F 

Sign extend 2 packed 8-bit integers in the low 2 bytes 
of xmm2/m16 to 2 packed 64-bit integers in xmmi 
subject to writemask k1. 

EVEX.256.66.0F38.WIG 22 /r 
VPMOVSXBQymmI {k1}{z}, 
xmm2/m32 

OVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 4 packed 8-bit integers in the low 4 bytes 
of xmm2/m32 to 4 packed 64-bit integers in ymmi 
subject to writemask k1. 

EVEX.512.66.0F38.WIG22 /r 
VPMOVSXBQzmmI {k1}[z], 
xmm2/m64 

OVM 

v/v 

AVX512F 

Sign extend 8 packed 8-bit integers in the low 8 bytes 
of xmm2/m64 to 8 packed 64-bit integers in zmmi 
subject to writemask k1. 

EVEX.128.66.0F38.WIG23/r 
VPMOVSXWD xmmi [k1}{z}, 
xmm2/m64 

HVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 4 packed 16-bit integers in the low 8 bytes 
of ymm2/mem to 4 packed 32-bit integers in xmmi 
subject to writemask k1. 

EVEX.256.66.0F38.WIG 23 /r 
VPMOVSXWD ymmi {k1}[z}, 
xmm2/m128 

HVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 8 packed 16-bit integers in the low 16 
bytes of ymm2/m128 to 8 packed 32-bit integers in 
ymmi subject to writemask k1. 

EVEX.512.66.0F38.WIG23/r 
VPMOVSXWD zmmi {k1}{z}, 
ymm2/m256 

HVM 

v/v 

AVX512F 

Sign extend 16 packed 16-bit integers in the low 32 
bytes of ymm2/m256 to 16 packed 32-bit integers in 
zmmi subject to writemask k1. 

EVEX.128.66.0F38.WIG24 /r 
VPMOVSXWD xmmi {k1}{z}, 
xmm2/m32 

QVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 2 packed 16-bit integers in the low 4 bytes 
of xmm2/m32 to 2 packed 64-bit integers in xmmi 
subject to writemask k1. 

EVEX.256.66.0F38.WIG 24 /r 
VPMOVSXWD ymmi [k1}[z}, 
xmm2/m64 

QVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 4 packed 16-bit integers in the low 8 bytes 
of xmm2/m64 to 4 packed 64-bit integers in ymmi 
subject to writemask k1. 

EVEX.512.66.0F38.WIG24 /r 
VPMOVSXWD zmmi [k1}[z}, 
xmm2/m128 

QVM 

v/v 

AVX512F 

Sign extend 8 packed 16-bit integers in the low 16 
bytes of xmm2/m128 to 8 packed 64-bit integers in 
zmmi subject to writemask k1. 

EVEX.128.66.0F38.W0 25/r 
VPMOVSXDQxmmI {k1]{z}, 
xmm2/m64 

HVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 2 packed 32-bit integers in the low 8 bytes 
of xmm2/m64 to 2 packed 64-bit integers in zmmi 
using writemask k1. 

EVEX.256.66.0F38.W0 25 /r 
VPMOVSXDQymmI {k1}{z}, 
xmm2/m128 

HVM 

v/v 

AVX512VL 

AVX512F 

Sign extend 4 packed 32-bit integers in the low 16 
bytes of xmm2/m128 to 4 packed 64-bit integers in 
zmmi using writemask k1. 

EVEX.512.66.0F38.W0 25/r 
VPMOVSXDQzmmI {k1}[z}, 
ymm2/m256 

HVM 

v/v 

AVX512F 

Sign extend 8 packed 32-bit integers in the low 32 
bytes of ymm2/m256 to 8 packed 64-bit integers in 
zmmi using writemask k1. 
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Instruction Operand Encod 

ing 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

HVM, QVM, OVM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Legacy and VEX encoded versions: Packed byte, word, or dword integers in the low bytes of the source operand 
(second operand) are sign extended to word, dword, or quadword integers and stored in packed signed bytes the 
destination operand. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 

VEX.128 and EVEX.128 encoded versions: Bits (MAX_VL-1:128) of the corresponding destination register are 
zeroed. 

VEX.256 and EVEX.256 encoded versions: Bits (MAX_VL-1:256) of the corresponding destination register are 
zeroed. 

EVEX encoded versions: Packed byte, word or dword integers starting from the low bytes of the source operand 
(second operand) are sign extended to word, dword or quadword integers and stored to the destination operand 
under the writemask. The destination register is XMM, VMM or ZMM Register. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 
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Operation 

Packed_Sign_Extend_BYTE_to_WORD(DEST,SRC) 

DEST[15:0] ^SIgnExtend(SRC[7:0]); 

DEST[31:16] ^SIgnExtend(SRC[15:8]); 

DEST[47:32] ^SIgnExtend(SRC[23:16]); 

DEST[63:48] ^SIgnExtend(SRC[31:24]); 

DEST[79:64] ^SIgnExtend(SRC[39:32]); 

DEST[95:80] ^SIgnExtend(SRC[47:40]); 

DEST[111:96] ^SIgnExtend(SRC[55:48]); 

DEST[127:112] ^SIgnExtend(SRC[63:56]); 

Packed_Sign_Extend_BYTE_to_DWORD(DEST, SRC) 

DEST[31:0] ^SIgnExtend(SRC[7:0]); 

DEST[63:32] ^SIgnExtend(SRC[15:8]); 

DEST[95:64] ^SIgnExtend(SRC[23:16]); 

DEST[127:96] ^SIgnExtend(SRC[31:24]); 

Packed_Sign_Extend_BYTE_to_QWORD(DEST,SRC) 

DEST[63:0] ^SIgnExtend(SRC[7:0]); 

DEST[127:64] ^SIgnExtend(SRC[15:8]); 

Packed_Sign_Extend_WORD_to_DWORD(DEST,SRC) 

DEST[31:0] ^SIgnExtend(SRC[15:0]); 

DEST[63:32] ^SIgnExtend(SRC[31:16]); 

DEST[95:64] ^SIgnExtend(SRC[47:32]); 

DEST[127:96] ^SIgnExtend(SRC[63:48]); 

Packed_Sign_Extend_WORD_to_QWORD(DEST,SRC) 

DEST[63:0] ^SIgnExtend(SRC[15:0]); 

DEST[127:64] ^SIgnExtend(SRC[31:16]); 

Packed_Sign_Extend_DWORD_to_QWORD(DEST,SRC) 

DEST[63:0] ^SIgnExtend(SRC[31:0]); 

DEST[127:64] ^SIgnExtend(SRC[63:32]); 

VPMOVSXBW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

Packed_Sign_Extend_BYTE_to_WORD(TMP_DEST[127:0], SRC[63:0]) 

IFVL>= 256 

Packed_Sign_Extend_BYTE_to_WORD(TMP_DEST[255:128], SRC[127:64]) 
FI; 

IFVL>= 512 

Packed_Slgn_Extend_BYTE_to_WORD(TMP_DEST[383:256], SRC[191:128]) 
Packed_Slgn_Extend_BYTE_to_WORD(TMP_DEST[511:384], SRC[255:192]) 
FI; 

FOR] ^0 TO KL-1 
I ^]* 16 

IF k10] OR *no writemask* 

THEN DEST[i+15:1] ^ TEMP_DEST[i+15:1] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:1] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i]^0 
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FI 

FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VPMOVSXBD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[127:0], SRC[31:0]) 
IFVL>=256 

Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[255:128], SRC[63:32]) 
FI; 

IFVL>=512 

Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[383:256], SRC[95:64]) 
Packed_Sign_Extend_BYTE_to_DWORD(TMP_DEST[511:384], SRC[127:96]) 
FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k1 [j] OR *no writemask* 

THEN DEST[l+31:i] ^ TEMP_DEST[l+31:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VPMOVSXBQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[127:0], SRC[15:0]) 
IFVL>=256 

Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[255:128], SRC[31:16]) 
FI; 

IFVL>=512 

Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[383:256], SRC[47:32]) 
Packed_Sign_Extend_BYTE_to_QWORD(TMP_DEST[511:384], SRC[63:48]) 
FI; 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[i+63:i] ^ TEMP_DEST[i+63:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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VPMOVSXWD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[127:0], SRC[63:0]) 
IFVL>=256 

Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[255:128], SRC[127:64]) 
FI; 

IFVL>=512 

Packed_Sign_Extend_WORD_to_DWORD(TMP_DEST[383:256], SRC[191:128]) 
Packed_Sign_Extend_W0RD_to_DW0RD(TMP_DEST[511:384], SRC[256:192]) 
FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k10] OR *no wrltemask* 

THEN DEST[i+31:l] ^ TEMP_DEST[i+31:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMOVSXWQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[127:0], SRC[31:0]) 
IFVL>=256 

Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[255:128], SRC[63:32]) 

FI; 

IFVL>=512 

Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[383:256], SRC[95:64]) 
Packed_Sign_Extend_WORD_to_QWORD(TMP_DEST[511:384], SRC[127:96]) 
FI; 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:i] ^ TEMP_DEST[i+63:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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VPMOVSXDQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[127:0], SRC[63:0]) 
IFVL>=256 

Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[255:128], SRC[127:64]) 
FI; 

IFVL>=512 

Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[383:256], SRC[191:128]) 
Packed_Sign_Extend_DWORD_to_QWORD(TEMP_DEST[511:384], SRC[255:192]) 
FI; 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TEMP_DEST[l+63:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VPMOVSXBW (VEX.256 encoded version) 

Packed_Sign_Extend_BYTE_to_WORD(DEST[127:0], SRC[63:0]) 
Packed_Sign_Extend_BYTE_to_WORD(DEST[255:128], SRC[127:64]) 
DEST[MAX_VL-1:256]^0 

VPMOVSXBD (VEX.256 encoded version) 

Packed_Sign_Extend_BYTE_to_DWORD(DEST[127:0], SRC[31:0]) 
Packed_Sign_Extend_BYTE_to_DWORD(DEST[255:128], SRC[63:32]) 
DEST[MAX_VL-1:256]^0 

VPMOVSXBQ (VEX.256 encoded version) 

Packed_Sign_Extend_BYTE_to_QWORD(DEST[127:0], SRC[15:0]) 
Packed_Sign_Extend_BYTE_to_QWORD(DEST[255:128], SRC[31:16]) 
DEST[MAX_VL-1:256]^0 

VPMOVSXWD (VEX.256 encoded version) 

Packed_Sign_Extend_W0RD_to_DW0RD(DEST[127:0], SRC[63:0]) 
Packed_Sign_Extend_WORD_to_DWORD(DEST[255:128], SRC[127:64]) 
DEST[MAX_VL-1:256]^0 

VPMOVSXWQ (VEX.256 encoded version) 

Packed_Sign_Extend_W0RD_to_QW0RD(DEST[127:0], SRC[31:0]) 
Packed_Sign_Extend_WORD_to_QWORD(DEST[255:128], SRC[63:32]) 
DEST[MAX_VL-1:256]^0 

VPMOVSXDQ (VEX.256 encoded version) 

Packed_Sign_Extend_DW0RD_to_QW0RD(DEST[127:0], SRC[63:0]) 
Packed_Sign_Extend_DWORD_to_QWORD(DEST[255:128], SRC[127:64]) 
DEST[MAX_VL-1:256]^0 
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VPMOVSXBW (VEX.128 encoded version) 

Pacl<ed_Slgn_Extend_BYTE_to_WORDDEST[127:0], SRC[127:0]() 

DEST[MAX_VL-1:1281^0 

VPMOVSXBD (VEX.128 encoded version) 

Packed_Sign_Extend_BYTE_to_DW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128]^0 

VPMOVSXBQ (VEX.128 encoded version) 

Packed_Sign_Extend_BYTE_to_QW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128]^0 

VPMOVSXWD (VEX.128 encoded version) 

Packed_Sign_Extend_W0RD_to_DW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128]^0 

VPMOVSXWQ (VEX.128 encoded version) 

Packed_Sign_Extend_WORD_to_QWORD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128]^0 

VPMOVSXBQ (VEX.128 encoded version) 

Packed_Sign_Extend_DW0RD_to_QW0RD(DEST[127:0], SRC[127:0]) 
DEST[MAX_VL-1:128]^0 

PMOVSXBW 

Packed_Sign_Extend_BYTE_to_WORD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVSXBD 

Packed_Sign_Extend_BYTE_to_DW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVSXBQ 

Packed_Sign_Extend_BYTE_to_QW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVSXWD 

Packed_Sign_Extend_W0RD_to_DW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVSXWQ 

Packed_Sign_Extend_WORD_to_QWORD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVSXBQ 

Packed_Sign_Extend_DW0RD_to_QW0RD(DEST[127:0], SRC[127:0]) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMOVSXBW _m512i _mm512_cvtepi8_epi16(_m512i a); 

VPMOVSXBW_mSI 2i _mm512_mask_cvtepi8_epi16(_mSI 2i a,_mmask32 k,_mSI 2i b); 

VPMOVSXBW_mSI 2i _mm512_maskz_cvtepi8_epi16(_mmask32 k,_mSI 2i b); 

VPMOVSXBD _m512i _mm512_cvtepi8_epi32(_m512i a); 

VPMOVSXBD_mSI 2i _mm512_mask_cvtepi8_epi32(_mSI 2i a,_mmaski 6 k,_m512i b); 
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VPMOVSXBD_m512i_mm512_maskz_cvtepi8_epi32(_mmaskIS k,_m512i b); 

VPMOVSXBQ_m5121 _mm512_cvtepl8_epl64(_m512i a); 

VPMOVSXBQ_mSI 21 _mm512_mask_cutepl8_epi64(_m512l a,_mmaskB k,_m512l b); 

VPMOVSXBQ_mSI 21 _mm512_maskz_cvtepl8_epl64(_mmaskB k,_m512l a); 

VPMOVSXDQ_m5121 _mm512_cvtepi32_epl64(_m512i a); 

VPMOVSXDQ_mSI 21 _mm512_mask_cvtepi32_epl64(_m512i a,_mmaskB k,_m512l b); 

VPMOVSXDQ_mSI 21 _mm512_maskz_cvtepi32_epl64(_mmaskB k,_m512l a); 

VPMOVSXWD _m5121 _mm512_cvtepl16_epi32(_m5121 a); 

VPMOVSXWD_mSI 21 _mm512_mask_cvtepl16_epi32(_mSI 21 a,_mmaski 6 k,_mSI 21 b); 

VPMOVSXWD_mSI 21 _mm512_maskz_cvtepl16_epi32(_mmaski 6 k,_mSI 2i a); 

VPMOVSXWQ_m5121 _mm512_cvtepi16_epi64(_m5121 a); 

VPMOVSXWQ_m512l_mm512_mask_cvtepi16_epl64(_m512l a,_mmaskB k,_m512l b); 

VPMOVSXWQ_m5121 _mm512_maskz_cvtepi16_epl64(_mmaskB k,_mSI 21 a); 

VPMOVSXBW_m256l _mm256_cvtepi8_epi16(_m256l a); 

VPMOVSXBW_m256l _mm256_mask_cvtepi8_epi16(_m256l a,_mmaski 6 k,_m256l b); 

VPMOVSXBW_m256l_mm256_maskz_cvtepl8_epl16(_mmaski 6 k,_m256l b); 

VPMOVSXBD _m256l _mm256_cvtepl8_epi32(_m256l a); 

VPMOVSXBD_m256l _mm256_mask_cvtepl8_epl32(_m256l a,_mmaskB k,_m256i b); 

VPMOVSXBD_m256i_mm256_maskz_cvtepi8_epi32(_mmaskB k,_m256l b); 

VPMOVSXBQ _m256l _mm256_cvtepi8_epl64(_m256i a); 

VPMOVSXBQ_m256l _mm256_mask_cvtepl8_epi64(_m256l a,_mmaskB k,_m256l b); 

VPMOVSXBQ_m256l _mm256_maskz_cvtepl8_epl64(_mmaskB k,_m256l a); 

VPMQVSXDQ_m256l _mm256_cvtepi32_epi64(_m256i a); 

VPMQVSXDQ_m256l _mm256_mask_cvtepi32_epl64(_m256i a,_mmaskB k,_m256l b); 

VPMQVSXDQ_m256l _mm256_maskz_cvtepi32_epl64(_mmaskB k,_m256l a); 

VPMQVSXWD _m256l _mm256_cvtepl16_epi32(_m256l a); 

VPMQVSXWD_m256l_mm256_mask_cvtepl16_epi32(_m256l a,_mmaski 6 k,_m256l b); 

VPMQVSXWD_m256l _mm256_maskz_cvtepl16_epi32(_mmaski 6 k,_m256i a); 

VPMDVSXWQ_m256l_mm256_cvtepi16_epi64(_m256la); 

VPMDVSXWQ_m256l_mm256_mask_cvtepi16_epl64(_m256l a,_mmaskB k,_m256l b); 

VPMDVSXWQ_m256l _mm256_maskz_cvtepi16_epl64(_mmaskB k,_m256l a); 

VPMDVSXBW_ml 281 _mm_mask_cvtepl8_epl16(_ml 281 a,_mmaskB k,_ml 281 b); 

VPMDVSXBW_ml 281 _mm_maskz_cvtepi8_epi16(_mmaskB k,_ml 28i b); 

VPMOVSXBD_ml 281 _mm_mask_cvtepi8_epi32(_ml 281 a,_mmaskB k,_ml 281 b); 

VPMOVSXBD_ml 281 _mm_maskz_cvtepl8_epl32(_mmaskB k,_ml 281 b); 

VPMOVSXBQ_ml 281 _mm_mask_cvtepl8_epl64(_ml 28i a,_mmaskB k,_ml 281 b); 

VPMQVSXBQ_ml 281 _mm_maskz_cvtepl8_epl64(_mmaskB k,_ml 281 a); 

VPMQVSXDQ_ml 281 _mm_mask_cvtepl32_epl64(_ml 281 a,_mmaskB k,_ml 281 b); 

VPMQVSXDQ_ml 281 _mm_maskz_cvtepi32_epi64(_mmaskB k,_ml 281 a); 

VPMQVSXWD_m128l_mm_mask_cvtepl16_epi32(_ml 281 a,_mmaski 6 k,_ml 281 b); 

VPMQVSXWD_ml 281 _mm_maskz_cvtepl16_epl32(_mmaski 6 k,_ml 281 a); 

VPMDVSXWQ_ml 281 _mm_mask_cvtepl16_epl64(_ml 281 a,_mmaskB k,_ml 281 b); 

VPMDVSXWQ_ml 281 _mm_maskz_cvtepi16_epl64(_mmaskB k,_ml 281 a); 

PMQVSXBW_m128i_mm_cvtepi8_epi16 (_ml 281 a); 

PMQVSXBD_ml 281 _mm_cvtepl8_epl32 (_m128i a); 

PMQVSXBQ_ml 281 _mm_ cvtepl8_epl64 (_ml 281 a); 

PMQVSXWD_ml 281 _mm_ cvtepil 6_epi32 (_ml 281 a); 

PMDVSXWQ_ml 281 _mm_ cvtepil 6_epl64 (_ml 281 a); 

PMQVSXDQ_ml 281 _mm_cvtepl32_epl64 (_ml 281 a); 

SIMD Floating-Point Exceptions 

None 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5. 
EVEX-encoded instruction, see Exceptions Type E5. 

#UD If VEX.vvvv != llllB, or EVEX.vvvv != llllB. 
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PMOVZX—Packed Move with Zero Extend 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 Of 38 30 /r 

PMOVZXBW xmmi, xmm2/m64 

RM 

V/V 

SSE4_1 

Zero extend 8 packed 8-bit integers in the low 8 
bytes of xmm2/m64 to 8 packed 16-bit integers in 
xmmi. 

66 Of 38 31 /r 

PMOVZXBD xmmi, xmm2/m3Z 

RM 

v/v 

SSE4_1 

Zero extend 4 packed 8-bit integers in the low 4 
bytes of xmm2/m32 to 4 packed 32-bit integers in 
xmmi. 

66 Of 38 32 /r 

PMOVZXBQ xmmi, xmm2/m16 

RM 

V/V 

SSE4_1 

Zero extend 2 packed 8-bit integers in the low 2 
bytes of xmm2/m16 to 2 packed 64-bit integers in 
xmmi. 

66 Of 38 33 /r 

PMOVZXWD xmmi, xmm2/m64 

RM 

v/v 

SSE4_1 

Zero extend 4 packed 16-bit integers in the low 8 
bytes of xmm2/m64 to 4 packed 32-bit integers in 
xmmi. 

66 Of 38 34 /r 

PMOVZXWQ xmmi, xmm2/m32 

RM 

v/v 

SSE4_1 

Zero extend 2 packed 16-bit integers in the low 4 
bytes of xmm2/m32 to 2 packed 64-bit integers in 
xmmi. 

66 Of 38 35 /r 

PMOVZXDQ xmmi, xmm2/m64 

RM 

v/v 

SSE4_1 

Zero extend 2 packed 32-bit integers in the low 8 
bytes of xmm2/m64 to 2 packed 64-bit integers in 
xmmi. 

VEX.1 28.66.0F38.WIG 30 /r 

VPMOVZXBW xmmi, xmm2/m64 

RM 

v/v 

AVX 

Zero extend 8 packed 8-bit integers in the low 8 
bytes of xmm2/m64 to 8 packed 16-bit integers in 
xmmi. 

VEX.1 28.66.0F38.WIG 31 /r 

VPMOVZXBD xmmi, xmm2/m32 

RM 

v/v 

AVX 

Zero extend 4 packed 8-bit integers in the low 4 
bytes of xmm2/m32 to 4 packed 32-bit integers in 
xmmi. 

VEX.1 28.66.0F38.WIG 32 /r 

VPMOVZXBQ xmmi, xmm2/m16 

RM 

v/v 

AVX 

Zero extend 2 packed 8-bit integers in the low 2 
bytes of xmm2/m16 to 2 packed 64-bit integers in 
xmmi. 

VEX.1 28.66.0F38.WIG 33 /r 

VPMOVZXWD xmm1,xmm2/m64 

RM 

v/v 

AVX 

Zero extend 4 packed 16-bit integers in the low 8 
bytes of xmm2/m64 to 4 packed 32-bit integers in 
xmmi. 

VEX.1 28.66.0F38.WIG 34/r 

VPMOVZXWQ xmmi, xmm2/m32 

RM 

v/v 

AVX 

Zero extend 2 packed 16-bit integers in the low 4 
bytes of xmm2/m32 to 2 packed 64-bit integers in 
xmmi. 

VEX.128.66.0F 38.WIG 35/r 

VPMOVZXDQ xmmi, xmm2/m64 

RM 

v/v 

AVX 

Zero extend 2 packed 32-bit integers in the low 8 
bytes of xmm2/m64 to 2 packed 64-bit integers in 
xmmi. 

VEX.256.66.0F38.WIG 30 /r 

VPMOVZXBW ymm 1, xmm2/m 128 

RM 

v/v 

AVX2 

Zero extend 16 packed 8-bit integers in 
xmm2/m128 to 16 packed 16-bit integers in ymmi. 

VEX.256.66.0F38.WIG 31 /r 

VPMOVZXBD ymmi, xmm2/m64 

RM 

v/v 

AVX2 

Zero extend 8 packed 8-bit integers in the low 8 
bytes of xmm2/m64 to 8 packed 32-bit integers in 
ymmi. 

VEX.256.66.0F38.WIG 32 /r 

VPMOVZXBQ ymmi, xmm2/m32 

RM 

v/v 

AVX2 

Zero extend 4 packed 8-bit integers in the low 4 
bytes of xmm2/m32 to 4 packed 64-bit integers in 
ymmi. 

VEX.256.66.0F38.WIG 33 /r 

VPMOVZXWD ymm 1, xmm2/m 128 

RM 

v/v 

AVX2 

Zero extend 8 packed 16-bit integers xmm2/m128 
to 8 packed 32-bit integers in ymmi. 

VEX.256.66.0F38.WIG 34/r 

VPMOVZXWQ ymmi, xmm2/m64 

RM 

v/v 

AVX2 

Zero extend 4 packed 16-bit integers in the low 8 
bytes of xmm2/m64 to 4 packed 64-bit integers in 
xmmi. 

VEX.256.66.0F38.WIG 35 /r 

VPMOVZXDQ ymmi, xmm2/m128 

RM 

v/v 

AVX2 

Zero extend 4 packed 32-bit integers in 
xmm2/m128 to 4 packed 64-bit integers in ymmi. 
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Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

EVEX.128.66.0F38 30.WIG/r 
VPMOVZXBWxmmI {k1]{z}, xmm2/m64 

HVM 

V/V 

AVX512VL 

AVX512BW 

Zero extend 8 packed 8-bit integers in the low 8 
bytes of xmm2/m64 to 8 packed 16-bit integers in 
xmmi. 

EVEX.256.66.0F38.WIG 30 /r 
VPMOVZXBWymmI {k1}{z}, 
xmm2/m128 

HVM 

v/v 

AVX512VL 

AVX512BW 

Zero extend 16 packed 8-bit integers in 
xmm2/m128 to 16 packed 16-bit integers in ymmi. 

EVEX.512.66.0F38.WIG30/r 
VPMOVZXBW zmmi [k1}{z}, 
ymm2/m256 

HVM 

V/V 

AVX512BW 

Zero extend 32 packed 8-bit integers in 
ymm2/m256 to 32 packed 16-bit integers in zmmi. 

EVEX.128.66.0F38.WIG31 /r 

VPMOVZXBD xmmi [k1 }[z], xmm2/m32 

DVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 4 packed 8-bit integers in the low 4 
bytes of xmm2/m32 to 4 packed 32-bit integers in 
xmmi subject to writemask k1. 

EVEX.256.66.0F38.WIG 31 /r 

VPMOVZXBD ymmi {k1]{z}, xmm2/m64 

DVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 8 packed 8-bit integers in the low 8 
bytes of xmm2/m64 to 8 packed 32-bit integers in 
ymmi subject to writemask k1. 

EVEX.512.66.0F38.WIG31 /r 

VPMOVZXBD zmmi {k1}[z}, 
xmm2/m128 

DVM 

v/v 

AVX512F 

Zero extend 16 packed 8-bit integers in 
xmm2/m128 to 16 packed 32-bit integers in zmmi 
subject to writemask k1. 

EVEX.128.66.0F38.WIG32/r 
VPMOVZXBQxmmI {k1}[z}, xmm2/m16 

DVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 2 packed 8-bit integers in the low 2 
bytes of xmm2/m16 to 2 packed 64-bit integers in 
xmmi subject to writemask k1. 

EVEX.256.66.0F38.WIG 32 /r 

VPMOVZXBQ ymmi [k1 }{z}, xmm2/m32 

DVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 4 packed 8-bit integers in the low 4 
bytes of xmm2/m32 to 4 packed 64-bit integers in 
ymmi subject to writemask k1. 

EVEX.512.66.0F38.WIG32/r 

VPMOVZXBQ zmmi {k1}{z}, xmm2/m64 

DVM 

v/v 

AVX512F 

Zero extend 8 packed 8-bit integers in the low 8 
bytes of xmm2/m64 to 8 packed 64-bit integers in 
zmmi subject to writemask k1. 

EVEX.128.66.0F38.WIG33/r 
VPMOVZXWDxmmI {k1}[z}, xmm2/m64 

HVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 4 packed 16-bit integers in the low 8 
bytes of xmm2/m64 to 4 packed 32-bit integers in 
xmmi subject to writemask k1. 

EVEX.256.66.0F38.WIG 33 /r 
VPMOVZXWD ymmi {k1}{z}, 
xmm2/m128 

HVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 8 packed 16-bit integers in 
xmm2/m128 to 8 packed 32-bit integers in zmmi 
subject to writemask k1. 

EVEX.512.66.0F38.WIG33/r 
VPMOVZXWD zmmi [k1]{z}, 
ymm2/m256 

HVM 

v/v 

AVX512F 

Zero extend 16 packed 16-bit integers in 
ymm2/m256 to 16 packed 32-bit integers in zmmi 
subject to writemask k1. 

EVEX.128.66.0F38.WIG34/r 
VPMOVZXWD xmmi {k1}[z}, xmm2/m32 

DVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 2 packed 16-bit integers in the low 4 
bytes of xmm2/m32 to 2 packed 64-bit integers in 
xmmi subject to writemask k1. 

EVEX.256.66.0F38.WIG 34 /r 
VPMOVZXWQymmI {k1}{z}, xmm2/m64 

DVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 4 packed 16-bit integers in the low 8 
bytes of xmm2/m64 to 4 packed 64-bit integers in 
ymmi subject to writemask k1. 

EVEX.512.66.0F38.WIG34/r 
VPMOVZXWD zmmi {k1}[z}, 
xmm2/m128 

DVM 

v/v 

AVX512F 

Zero extend 8 packed 16-bit integers in 
xmm2/m128 to 8 packed 64-bit integers in zmmi 
subject to writemask k1. 
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Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

EVEX.128.66.0F38.W0 35/r 
VPMOVZXDQxmmI {k1]{z], xmm2/m64 

HVM 

V/V 

AVX512VL 

AVX512F 

Zero extend 2 packed 32-bit integers in the low 8 
bytes of xmm2/m64 to 2 packed 64-bit integers in 
zmmi using writemask k1. 

EVEX.256.66.0F38.W0 35 /r 
VPMOVZXDQymmI [k1}[z], 
xmm2/m128 

HVM 

v/v 

AVX512VL 

AVX512F 

Zero extend 4 packed 32-bit integers in 
xmm2/m128 to 4 packed 64-bit integers in zmmi 
using writemask k1. 

EVEX.512.66.0F38.W0 35/r 
VPMOVZXDQzmmI [k1}{z}, 
ymm2/m256 

HVM 

V/V 

AVX512F 

Zero extend 8 packed 32-bit integers in 
ymm2/m256 to 8 packed 64-bit integers in zmmi 
using writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

HVM, QVM, OVM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Legacy, VEX and EVEX encoded versions: Packed byte, word, or dword integers starting from the low bytes of the 
source operand (second operand) are zero extended to word, dword, or quadword integers and stored in packed 
signed bytes the destination operand. 

128-bit Legacy SSE version: Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 
VEX.128 encoded version: Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. 

VEX.256 encoded version: Bits (MAX_VL-1:256) of the corresponding destination register are zeroed. 

EVEX encoded versions: Packed dword integers starting from the low bytes of the source operand (second 
operand) are zero extended to quadword integers and stored to the destination operand under the writemask.The 
destination register is XMM, VMM or ZMM Register. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

Operation 

Packed_Zero_Extend_BYTE_to_WORD(DEST, SRC) 

DEST[15:0] ^ZeroExtend(SRC[7:0]); 

DEST[31:16] ^ZeroExtend(SRC[15:8]); 

DEST[47:32] ^ZeroExtend(SRC[Z3:16]); 

DEST[63:48] ^ZeroExtend(SRC[31 :Z4]); 

DEST[79:64] ^ZeroExtend(SRC[39:3Z]); 

DEST[95:80] ^ZeroExtend(SRC[47:40]); 

DEST[111:96] ^ZeroExtend(SRC[55:48]); 

DEST[127:112] ^ZeroExtend(SRC[63:56]); 

Packed_Zero_Extend_BYTE_to_DWORD(DEST,SRC) 

DEST[31:0] ^ZeroExtend(SRC[7:0]); 

DEST[63:32] ^ZeroExtend(SRC[15:8]); 

DEST[95:64] ^ZeroExtend(SRC[23:16]); 

DEST[127:96] ^ZeroExtend(SRC[31:24]); 

Packed_Zero_Extend_BYTE_to_QWORD(DEST, SRC) 

DEST[63:0] ^ZeroExtend(SRC[7:0]); 

DEST[127:64] ^ZeroExtend(SRC[15:8]); 
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Packed_Zero_Extend_WORD_to_DWORD(DEST,SRC) 

DEST[31:0] ^ZeroExtend(SRC[15:0]); 

DEST[63:32] ^ZeroExtend(SRC[31:16]); 

DEST[95:64] ^ZeroExtend(SRC[47:32]); 

DEST[127:96] ^ZeroExtend(SRC[63:48]); 

Packed_Zero_Extend_WORD_to_QWORD(DEST,SRC) 

DEST[63:0] ^ZeroExtend(SRC[15:0]); 

DEST[127:64] ^ZeroExtend(SRC[31:16]); 

Packed_Zero_Extend_DWORD_to_QWORD(DEST, SRC) 

DEST[63:0] ^ZeroExtend(SRC[31:0]); 

DEST[127:64] ^ZeroExtend(SRC[63:32]); 

VPMOVZXBW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[127:0], SRC[63:0]) 
IFVL>=256 

Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[255:128], SRC[127:64]) 
FI; 

IFVL>=512 

Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[383:256], SRC[191:128]) 
Packed_Zero_Extend_BYTE_to_WORD(TMP_DEST[511:384], SRC[255:192]) 
FI; 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k10] OR *no wrltemask* 

THEN DEST[i+15:1] ^ TEMP_DEST[i+15:1] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMOVZXBD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[127:0], SRC[31:0]) 
IFVL>=256 

Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[255:128], SRC[63:32]) 
FI; 

IFVL>=512 

Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[383:256], SRC[95:64]) 
Packed_Zero_Extend_BYTE_to_DWORD(TMP_DEST[511:384], SRC[127:96]) 
FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k10] OR *no wrltemask* 

THEN DEST[i+31:i] ^ TEMP_DEST[i+31:i] 

ELSE 

IF *merging-masking* ; merging-masking 
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THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VPMOVZXBQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

Packed_Zero_Extend_BYTE_to_QWORD(TMP_DEST[127:0], SRC[15:0]) 
IFVL>=256 

Packed_Zero_Extend_BYTE_to_QW0RD(TMP_DEST[255:128], SRC[31:16]) 

FI; 

IFVL>=512 

Packed_Zero_Extend_BYTE_to_QW0RD(TMP_DEST[383:256],SRC[47:32]) 
Packed_Zero_Extend_BYTE_to_QW0RD(TMP_DEST[511:384], SRC[63:48]) 

FI; 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TEMP_DEST[l+63:i] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VPMOVZXWD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[127:0], SRC[63:0]) 
IFVL>=256 

Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[255:128], SRC[127:64]) 
FI; 

IFVL>=512 

Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[383:256], SRC[191:128]) 
Packed_Zero_Extend_WORD_to_DWORD(TMP_DEST[511:384], SRC[256:192]) 
FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31:i] ^ TEMP_DEST[i+31:i] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 
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DEST[MAX_VL-1:VL]^0 

VPMOVZXWQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

Pacl<ed_Zero_Extend_WORD_to_QWORD(TMP_DEST[127:0], SRC[31:0]) 

IFVL>=256 

Packed_Zero_Extend_W0RD_to_QW0RD(TMP_DEST[255:128], SRC[63:32]) 

FI; 

IFVL>=512 

Packed_Zero_Extend_W0RD_to_QW0RD(TMP_DEST[383:256], SRC[95:64]) 
Packed_Zero_Extend_W0RD_to_QW0RD(TMP_DEST[511:384], SRC[127:96]) 

FI; 

FOR] ^0 TO KL-1 
i ^ J * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:l] ^ TEMP_DEST[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMOVZXDQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[127:0], SRC[63:0]) 
IFVL>=256 

Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[255:128], SRC[127:64]) 
FI; 

IFVL>=512 

Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[383:256], SRC[191:128]) 
Packed_Zero_Extend_DWORD_to_QWORD(TEMP_DEST[511:384], SRC[255:192]) 
FI; 

FOR] ^0 TO KL-1 
i ^ J * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:i] ^ TEMP_DEST[i+63:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMOVZXBW (VEX.256 encoded version) 

Packed_Zero_Extend_BYTE_to_WORD(DEST[127:0], SRC[63:0]) 
Packed_Zero_Extend_BYTE_to_WORD(DEST[255:128], SRC[127:64]) 
DEST[MAX_VL-1:256]^0 
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VPMOVZXBD (VEX.Z56 encoded version) 

Packed_Zero_Extend_BYTE_to_DWORD(DEST[127:0], SRC[31:0]) 
Packed_Zero_Extend_BYTE_to_DWORD(DEST[255:128], SRC[63:32]) 
DEST[MAX_VL-1:256]^0 

VPMOVZXBQ (VEX.256 encoded version) 

Packed_Zero_Extend_BYTE_to_QWORD(DEST[127:0], SRC[15:0]) 
Packed_Zero_Extend_BYTE_to_QWORD(DEST[255:128], SRC[31:16]) 
DEST[MAX_VL-1:256]^0 

VPMOVZXWD (VEX.256 encoded version) 

Packed_Zero_Extend_W0RD_to_DW0RD(DEST[127:0], SRC[63:0]) 
Packed_Zero_Extend_W0RD_to_DW0RD(DEST[255:128], SRC[127:64]) 
DEST[MAX_VL-1:256]^0 

VPMOVZXWQ (VEX.256 encoded version) 

Packed_Zero_Extend_W0RD_to_QW0RD(DEST[127:0], SRC[31:0]) 
Packed_Zero_Extend_W0RD_to_QW0RD(DEST[255:128], SRC[63:32]) 
DEST[MAX_VL-1:256]^0 

VPMOVZXBQ (VEX.256 encoded version) 

Packed_Zero_Extend_DW0RD_to_QW0RD(DEST[127:0], SRC[63:0]) 
Packed_Zero_Extend_DWORD_to_QWORD(DEST[255:128], SRC[127:64]) 
DEST[MAX_VL-1:256]^0 

VPMOVZXBW (VEX.128 encoded version) 

Packed_Zero_Extend_BYTE_to_WORD() 

DEST[MAX_VL-1:128] ^0 

VPMOVZXBD (VEX.128 encoded version) 

Packed_Zero_Extend_BYTE_to_DWORD() 

DEST[MAX_VL-1:128] ^0 

VPMOVZXBQ (VEX.128 encoded version) 

Packed_Zero_Extend_BYTE_to_QWORD() 

DEST[MAX_VL-1:128] ^0 

VPMOVZXWD (VEX.128 encoded version) 

Packed_Zero_Extend_WORD_to_DWORD() 

DEST[MAX_VL-1:128] ^0 

VPMOVZXWQ (VEX.128 encoded version) 

Packed_Zero_Extend_WORD_to_QWORD() 

DEST[MAX_VL-1:128] ^0 

VPMOVZXBQ (VEX.128 encoded version) 

Packed_Zero_Extend_DWORD_to_QWORD() 

DEST[MAX_VL-1:128] ^0 

PMOVZXBW 

Packed_Zero_Extend_BYTE_to_WORD() 

DEST[MAX_VL-1:128] (Unmodified) 
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PMOVZXBD 

Pacl<ed_Zero_Extend_BYTE_to_DWORD() 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVZXBQ 

Packed_Zero_Extend_BYTE_to_QWORD() 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVZXWD 

Packed_Zero_Extend_WORD_to_DWORD() 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVZXWQ 

Packed_Zero_Extend_WORD_to_QWORD() 

DEST[MAX_VL-1:128] (Unmodified) 

PMOVZXDQ 

Packed_Zero_Extend_DWORD_to_QWORD() 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMOVZXBW _m5121 _mm512_cvtepu8_epi16(_m256i a); 

VPMOVZXBW_mSI 2i _mm512_mask_cvtepu8_epi16(_mSI 21 a,_mmask32 k,_m256i b); 

VPMOVZXBW_m512i_mm512_maskz_cvtepu8_epi16(_mmask32 k,_m256i b); 

VPMOVZXBD _m512i _mm512_cvtepu8_epi32(_m1281 a); 

VPMOVZXBD_m512i_mm512_mask_cvtepu8_epi32(_m512i a,_mmasklB k,_ml 281 b); 

VPMOVZXBD_mSI 21 _mm512_maskz_cvtepu8_epi32(_mmask16 k,_ml 281 b); 

VPMOVZXBQ_m5121 _mm512_cvtepu8_epi64(_m128i a); 

VPMOVZXBQ_m512i_mm512_mask_cvtepu8_epi64(_m512i a,_mmaskS k,_ml 281 b); 

VPMOVZXBQ_mSI 21 _mm512_maskz_cvtepu8_epi64(_mmaskS k,_ml 281 a); 

VPMOVZXDQ_m5121 _mm512_cvtepu32_epi64(_m256i a); 

VPMOVZXDQ_mSI 21 _mm512_mask_cvtepu32_epi64(_m512i a,_mmaskS k,_m256i b); 

VPMOVZXDQ_m512i_mm512_maskz_cvtepu32_epi64(_mmaskS k,_m256i a); 

VPMOVZXWD _m512i _mm512_cvtepu16_epi32(_m1281 a); 

VPMOVZXWD_m512i _mm512_mask_cvtepu16_epi32(_m512i a,_mmaski 6 k,_ml 281 b); 

VPMOVZXWD_mSI 2i _mm512_maskz_cvtepu16_epi32(_mmaski 6 k,_ml 281 a); 

VPMOVZXWD _m512i _mm512_cvtepu16_epi64(_m256i a); 

VPMOVZXWQ_mSI 21 _mm512_mask_cvtepu16_epi64(_mSI 21 a,_mmaskS k,_m256i b); 

VPMOVZXWQ_m512i _mm512_maskz_cvtepu16_epi64(_mmaskS k,_m256i a); 

VPMOVZXBW _m256i _mm256_cvtepu8_epi16(_m256i a); 

VPMOVZXBW_m256i _mm256_mask_cvtepu8_epi16(_m256i a,_mmaski 6 k,_ml 281 b); 

VPMOVZXBW_m256i_mm256_maskz_cvtepu8_epi16(_mmaski 6 k,_ml 281 b); 

VPMOVZXBD _m256i _mm256_cvtepu8_epi32(_m1281 a); 

VPMOVZXBD_m256i _mm256_mask_cvtepu8_epi32(_m256i a,_mmaskS k,_ml 281 b); 

VPMOVZXBD_m256i _mm256_maskz_cvtepu8_epi32(_mmaskS k,_ml 281 b); 

VPMOVZXBQ _m256i _mm256_cvtepu8_epi64(_m128i a); 

VPMOVZXBQ_m256i _mm256_mask_cvtepu8_epi64(_m256i a,_mmaskS k,_ml 281 b); 

VPMQVZXBQ_m256i _mm256_maskz_cvtepu8_epi64(_mmaskS k,_ml 281 a); 

VPMQVZXDQ_m256i _mm256_cvtepu32_epi64(_m1281 a); 

VPMQVZXDQ_m256i _mm256_mask_cvtepu32_epi64(_m256i a,_mmaskS k,_ml 281 b); 

VPMQVZXDQ_m256i _mm256_maskz_cvtepu32_epi64(_mmaskS k,_ml 281 a); 

VPMQVZXWD _m256i _mm256_cvtepu16_epi32(_m1281 a); 

VPMQVZXWD_m256i _mm256_mask_cvtepu16_epi32(_m256i a,_mmaski 6 k,_ml 281 b); 

VPMQVZXWD_m256i _mm256_maskz_cvtepu16_epi32(_mmaski 6 k,_ml 281 a); 
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VPMOVZXWQ_m256i_mm256_cvtepu16_epl64(_m128l a); 

VPMOVZXWQ_m256i_mm256_mask_cvtepu16_epl64(_m256l a,_mmaskS k,_m128i b); 

VPMOVZXWQ_m256i _mm256_maskz_cvtepu16_epl64(_mmask8 k,_ml 281 a); 

VPMOVZXBW_m128i _mm_mask_cvtepu8_epl16(_ml 281 a,_mmask8 k,_ml 281 b); 

VPMOVZXBW_ml 281 _mm_maskz_cvtepu8_epi16(_mmask8 k,_ml 281 b); 

VPMOVZXBD_ml 281 _mm_mask_cvtepu8_epl32(_ml 281 a,_mmask8 k,_ml 281 b); 

VPMOVZXBD_ml 281 _mm_maskz_cutepu8_epi32(_mmask8 k,_ml 28i b); 

VPMOVZXBQ_ml 281 _mm_mask_cvtepu8_epl64(_ml 281 a,_mmask8 k,_ml 28i b); 

VPMOVZXBQ_ml 281 _mm_maskz_cvtepu8_epi64(_mmask8 k,_ml 281 a); 

VPMOVZXDQ_m128i_mm_mask_cvtepu32_epl64(_m128l a,_mmask8 k,_m128l b); 

VPMOVZXBQ_m128i_mm_maskz_cvtepu32_epl64(_mmask8 k,_m128l a); 

VPMOVZXWQ_ml 281 _mm_mask_cvtepu16_epl32(_ml 281 a,_mmaski 6 k,_ml 281 b); 

VPMOVZXWQ_ml 281 _mm_maskz_cvtepu16_epi32(_mmask8 k,_ml 281 a); 

VPMOVZXWQ_ml 28i _mm_mask_cvtepu16_epl64(_ml 281 a,_mmask8 k,_ml 28i b); 

VPMOVZXWQ_ml 281 _mm_maskz_cvtepu16_epi64(_mmask8 k,_ml 281 a); 

PMQVZXBW_ml 28i_mm_cvtepu8_epl16 (_ml 281 a); 

PMQVZXBQ_ml 281 _mm_ cvtepu8_epi32 (_ml 281 a); 

PMQVZXBQ_ml 281 _mm_ cvtepu8_epi64 (_ml 281 a); 

PMQVZXWO_m128l_mm_cvtepu16_epl32 (_ml 281 a); 

PMQVZXWQ_ml 28i_mm_cvtepu16_epl64 (_ml 281 a); 

PMQVZXBQ_ml 281 _mm_ cvtepu32_epi64 (_ml 281 a); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 5. 

EVEX-encoded instruction, see Exceptions Type E5. 

#UD If VEX.vvvv != llllB, or EVEX.vvvv != llllB. 
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PMULDQ—Multiply Packed Doubleword Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 28 /r 

PMULDQ xmmi, xmm2/m128 

RM 

V/V 

SSE4_1 

Multiply packed signed doubleword integers in xmmi by 
packed signed doubleword integers in xmm2/m128, and 
store the quadword results in xmmi. 

VEX.NDS.128.66.0F38.WIG 28 /r 
VPMULDQ xmmi, xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Multiply packed signed doubleword integers in xmm2 by 
packed signed doubleword integers in xmm3/m128, and 
store the quadword results in xmmi. 

VEX.NDS.256.66.0F38.WIG 28 /r 
VPMULDQ ymmi, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX2 

Multiply packed signed doubleword integers in ymm2 by 
packed signed doubleword integers in ymm3/m256, and 
store the quadword results in ymmi. 

EVEX.NDS.128.66.0F38.W1 28/r 
VPMULDQ xmmi {k1}[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply packed signed doubleword integers in xmm2 by 
packed signed doubleword integers in 
xmm3/m128/m64bcst, and store the quadword results in 
xmmi using writemask k1. 

EVEX.NDS.256.66.0F38.W1 28/r 
VPMULDQ ymmi (k1 }[z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply packed signed doubleword integers in ymm2 by 
packed signed doubleword integers in 
ymm3/m256/m64bcst, and store the quadword results in 
ymmi using writemask k1. 

EVEX.NDS.512.66.0F38.W1 28/r 
VPMULDQ zmmi {k1}[z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Multiply packed signed doubleword integers in zmm2 by 
packed signed doubleword integers in 
zmm3/m512/m64bcst, and store the quadword results in 
zmmi using writemask k1. 


Instruction Operand Encoding 


Qp/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiplies packed signed doubleword integers in the even-numbered (zero-based reference) elements of the first 
source operand with the packed signed doubleword integers in the corresponding elements of the second source 
operand and stores packed signed quadword results in the destination operand. 

128-bit Legacy SSE version: The input signed doubleword integers are taken from the even-numbered elements of 
the source operands, i.e. the first (low) and third doubleword element. For 128-bit memory operands, 128 bits are 
fetched from memory, but only the first and third doublewords are used in the computation. The first source 
operand and the destination XMM operand is the same. The second source operand can be an XMM register or 128- 
bit memory location. Bits (MAX_VL-1:128) of the corresponding destination register remain unchanged. 

VEX. 128 encoded version: The input signed doubleword integers are taken from the even-numbered elements of 
the source operands, i.e., the first (low) and third doubleword element. For 128-bit memory operands, 128 bits are 
fetched from memory, but only the first and third doublewords are used in the computation.The first source 
operand and the destination operand are XMM registers. The second source operand can be an XMM register or 
128-bit memory location. Bits (MAX_VL-1:128) of the corresponding destination register are zeroed. 

VEX.256 encoded version: The input signed doubleword integers are taken from the even-numbered elements of 
the source operands, i.e. the first, 3rd, 5th, 7th doubleword element. For 256-bit memory operands, 256 bits are 
fetched from memory, but only the four even-numbered doublewords are used in the computation. The first source 
operand and the destination operand are VMM registers. The second source operand can be a VMM register or 256- 
bit memory location. Bits (MAX_VL-1:256) of the corresponding destination ZMM register are zeroed. 
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EVEX encoded version: The input signed doubleword integers are taken from the even-numbered elements of the 
source operands. The first source operand is a ZMM/YMM/XMM registers. The second source operand can be an 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64- 
bit memory location. The destination is a ZMM/YMM/XMM register, and updated according to the writemask at 64- 
bit granularity. 

Operation 

VPMULDQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN DEST[l+63:i] ^ SlgnExtend64( SRC1 [i+31 :i]) * SignExtend64( SRC2[31:0]) 

ELSE DEST[i+63:i] ^ SlgnExtend64( SRC1 [1+31:!]) * SlgnExtend64( SRC2[I+31:!]) 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMULDQ (VEX.256 encoded version) 

DEST[63:0] ^SignExtend64( SRC1 [31:0]) * SignExtend64( SRC2[31:0]) 

DEST[127:64] ^SignExtend64( SRC1 [95:64]) * SignExtend64( SRC2[95:64]) 

DEST[191:128] ^SignExtend64( SRC1 [159:128]) * SignExtend64( SRC2[159:128]) 

DEST[255:192] ^SignExtend64( SRC1 [223:192]) * SignExtend64( SRC2[223:192]) 

DEST[MAX_VL-1:256] ^0 

VPMULDQ (VEX.128 encoded version) 

DEST[63:0] ^SignExtend64( SRC1 [31:0]) * SignExtend64( SRC2[31:0]) 

DEST[127:64] ^SignExtend64( SRC1 [95:64]) * SignExtend64( SRC2[95:64]) 

DEST[MAX_VL-1:128] ^0 

PMULDQ (128-bit Legacy SSE version) 

DEST[63:0] ^SignExtend64( DEST[31:0]) * SignExtend64( SRC[31:0]) 

DEST[127:64] ^SignExtend64( DEST[95:64]) * SignExtend64( SRC[95:64]) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VPMULDQ _m512i _mm512_muLepi32(_m5121 a,_m512i b); 

VPMULDQ_m5121 _mm512_mask_mul_epi32(_m5121 s,_mmask8 k,_m5121 a,_m512i b); 

VPMULDQ_m5121 _mm512_maskz_mul_epi32(_mmaskS k,_m512i a,_m512i b); 

VPMULDQ_m256i _mm256_mask_mul_epi32(_m256i s,_mmask8 k,_m256i a,_m256i b); 

VPMULDQ_m256i _mm256_mask_mul_epi32(_mmaskS k,_m256i a,_m256i b); 

VPMULDQ_ml 28i _mm_mask_mul_epi32(_ml 28i s,_mmaskS k,_ml 281 a,_ml 28i b); 

VPMULDQ_ml 281 _mm_mask_mul_epi32(_mmaskS k,_ml 281 a,_m128i b); 

(V)PMULDQ_m128i_mm_muLepi32(_m128i a_m128i b); 

VPMULDQ _m256i _mm256_muLepi32( _m256i a, _m256i b); 
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SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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PMULHRSW — Packed Multiply High with Round and Scale 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 OB /r ' 

PMULHRSW mml, mm2/m64 

RM 

V/V 

SSSE3 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
mmh 

66 OF 38 OB/r 

PMULHRSW xmmi, xmm2/m128 

RM 

v/v 

SSSE3 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
xmmi. 

VEX.NDS.128.66.0F38.WIC OB It 

VPMULHRSW xmm 1, xmm2, xmm3/m 128 

RVM 

V/V 

AVX 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
xmmi. 

VEX.NDS.256.66.0F38.WIG OB It 

VPMULHRSW ymm 1, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
ymml. 

EVEX.NDS.128.66.0F38.WIG OB It 

VPMULHRSW xmmi [k1 ][z], xtntnZ, xmm3/m128 

FVM 

v/v 

AVX512VL 
AVX512BW 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
xmmi under writemask k1. 

EVEX.NDS.256.66.0F38.WIG OB It 

VPMULHRSW ymmi (k1 }[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 
AVX512BW 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
ymml under writemask k1. 

EVEX.NDS.512.66.0F38.WIG OB It 

VPMULHRSW zmmi [k1}[z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Multiply 16-bit signed words, scale and round 
signed doublewords, pack high 16 bits to 
zmmi under writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

PMULHRSW multiplies vertically each signed 16-bit integer from the destination operand (first operand) with the 
corresponding signed 16-bit integer of the source operand (second operand), producing intermediate, signed 32- 
bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always 
performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by 
selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and 
packed to the destination operand. 

When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a 
general-protection exception (#GP) will be generated. 

In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15 registers. 

Legacy SSE version 64-bit operand: Both operands can be MMX registers. The second source operand is an MMX 
register or a 64-bit memory location. 
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128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM 
register conditionally updated with writemask kl. 

Operation 

PMULHRSW (with 64-bit operands) 

temp0[31:0] = INT32 ((DEST[15:0] * SRC[15:0]) >>14) + 1; 
tempi [31:0] = INT32 ((DEST[31:16] * SRC[31:16]) »14) + 1; 
temp2[31:0] = INT32 ((DEST[47:32] * SRC[47:32]) » 14) + 1; 
temp3[31:0] = INT32 ((DEST[63:48] * SRc[63:48]) >> 14) + 1; 

DEST[15:0] = temp0[16:1]; 

DEST[31:16] = tempi [16:1]; 

DEST[47:32] = temp2[16:1]; 

DEST[63:48] = temp3[16:1]; 

PMULHRSW (with 128-bit operand) 

temp0[31:0] = INT32 ((DEST[15:0] * SRC[15:0]) >>14) + 1; 
tempi [31:0] = INT32 ((DEST[31:16] * SRC[31:16]) »14) + 1; 
temp2[31:0] = INT32 ((DEST[47:32] * SRC[47:32]) »14)+ 1; 
temp3[31:0] = INT32 ((DEST[63:48] * SRC[63:48]) »14) + 1; 
temp4[31:0] = INT32 ((DEST[79:64] * SRC[79:64]) »14) + 1; 
temp5[31:0] = INT32 ((DEST[95:80] * SRC[95:80]) »14) + 1; 
temp6[31:0] = INT32 ((DEST[111:96] * SRC[111:96]) »14) + 1; 
temp7[31:0] = INT32 ((DEST[127:112] * SRC[127:112) > > 14) + 1; 

DEST[15:0] = temp0[16:1]; 

DEST[31:16] = tempi [16:1]; 

DEST[47:32] = temp2[16:1]; 

DEST[63:48] = temp3[16:1]; 

DEST[79:64] = temp4[16:1]; 

DEST[95:80] = temp5[16:1]; 

DEST[111:96] = temp6[16:1]; 

DEST[127:112] = temp7[16:1]; 

VPMULHRSW (VEX.128 encoded version) 

temp0[31:0] ^ INT32 ((SRC1[15:0] * SRC2[15:0]) »14)+ 1 
tempi [31:0] ^ INT32 ((SRC1 [31:16] * SRC2[31:16]) »14) + 1 
temp2[31:0] ^ INT32 ((SRC1 [47:32] * SRC2[47:32]) »14) + 1 
temp3[31:0] ^ INT32 ((SRC1 [63:48] * SRC2[63:48]) »14) + 1 
temp4[31:0] ^ INT32 ((SRC1 [79:64] * SRC2[79:64]) »14) + 1 
temp5[31:0] ^ INT32 ((SRC1 [95:80] * SRC2[95:80]) »14) + 1 
temp6[31:0] ^ INT32 ((SRC1 [111:96] * SRC2[111:96]) >> 14) + 1 
temp7[31:0] ^ INT32 ((SRC1 [127:112] * SRC2[127:112) »14) + 1 
DEST[15:0] ^temp0[16:1] 

DEST[31:16] ^ tempi [16:1] 

DEST[47:32]^temp2[16:1] 
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DEST[63:48] ^temp3[16:1] 

DEST[79:64] ^temp4[16:1] 

DEST[95:80] ^temp5[16:1] 

DEST[111:96] ^temp6[16:1] 

DEST[127:112] ^temp7[16:1] 

DEST[VLMAX-1:128]^0 

VPMULHRSW (VEX.256 encoded version) 

temp0[31:0] ^ INT32 ((SRC1 [15:0] * SRC2[15:0]) »14)+ 1 
tempi [31:0] ^ INT32 ((SRC1 [31:16] * SRC2[31:16]) »14) + 1 
temp2[31:0] ^ INT32 ((SRC1 [47:32] * SRC2[47:32]) »14) + 1 
temp3[31:0] ^ INT32 ((SRC1 [63:48] * SRC2[63:48]) »14) + 1 
temp4[31:0] ^ INT32 ((SRC1 [79:64] * SRC2[79:64]) »14) + 1 
temp5[31:0] ^ INT32 ((SRC1 [95:80] * SRC2[95:80]) »14) + 1 
temp6[31:0] ^ INT32 ((SRC1 [111 :96] * SRC2[111:96]) > > 14) + 1 
temp7[31:0] ^ INT32 ((SRC1 [127:112] * SRC2[127:112) »14) + 1 
temp8[31:0] ^ INT32 ((SRC1 [143:128] * SRC2[143:128]) »14) + 1 
temp9[31:0] ^ INT32 ((SRC1 [159:144] * SRC2[159:144]) »14) + 1 
tempi 0[31:0] ^ INT32 ((SRC1 [75:160] * SRC2[175:160]) »14) + 1 
tempi 1 [31:0] ^ INT32 ((SRC1 [191:176] * SRC2[191:176]) >>14) + 1 
tempi 2[31:0] ^ INT32 ((SRC1 [207:192] * SRC2[207:192]) >> 14) + 1 
tempi 3[31:0] ^ INT32 ((SRC1 [223:208] * SRC2[223:208]) >>14) + 1 
tempi4[31:0] ^ INT32 ((SRC1 [239:224] * SRC2[239:224]) >>14) + 1 
tempi 5[31:0] ^ INT32 ((SRC1 [255:240] * SRC2[255:240) »14) + 1 

DEST[15:0] etemp0[16:1] 

DEST[31:16] ^ tempi [16:1] 

DEST[47:32] ^temp2[16:1] 

DEST[63:48] ^temp3[16:1] 

DEST[79:64] ^temp4[16:1] 

DEST[95:80] ^temp5[16:1] 

DEST[111:96]^temp6[16:1] 

DEST[127:112] ^temp7[16:1] 

DEST[143:128]^temp8[16:1] 

DEST[159:144] ^temp9[16:1] 

DEST[175:160] ^ tempi 0[16:1] 

DEST[191:176] ^ tempi 1 [16:1] 

DEST[207:192] ^ tempi 2[16:1 ] 

DEST[223:208] ^ tempi 3[16:1] 

DEST[239:224] ^ tempi 4[16:1 ] 

DEST[255:240] ^ tempi 5[16:1 ] 

DEST[MAX_VL-1:256]^0 


VPMULHRSW (EVEX encoded version) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^j* 16 

IF k1 [j] OR *no writemask* 

THEN 

temp[31:0] ^ ((SRC1 [i+15:1] * SRC2[I+15:1]) > > 14) + 1 
DEST[i+15:i] ^tmp[16:1] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 
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ELSE *zeroing-masl<lng* ; zeroIng-maskIng 

DEST[I+15:I]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPMULHRSW _m5121 _mm512_mulhrs_epi16(_m5121 a, _m5121 b); 

VPMULFIRSW_m512l_mm512_mask_mulhrs_epl16(_m512i s,_mmask32 k,_m512l a,_m512l b); 

VPMULFIRSW_mSI 21 _mm512_maskz_mulhrs_epl16(_mmask32 k,_mSI 21 a,_mSI 2i b); 

VPMULFIRSW_m256l_mm256_mask_mulhrs_epl16(_m256i s,_mmaskIS k,_m256l a,_m256l b); 

VPMULFIRSW_m256l_mm256_maskz_mulhrs_epl16(_mmasklE k,_m256l a,_m256i b); 

VPMULFIRSW_ml 281 _mm_mask_mulhrs_epl16(_ml 281 s,_mmask8 k,_ml 281 a,_ml 28i b); 

VPMULFIRSW_ml 281 _mm_maskz_mulhrs_epi16(_mmask8 k,_ml 28i a,_ml 281 b); 

PMULFIRSW: _m64 _mm_mulhrs_pl16 (_m64 a,_m64 b) 

(V)PMULHRSW: _m1281 _mm_mulhrs_epl16 (_m1281 a, _m1281 b) 

VPMULHRSW:_m256l _mm256_mulhrs_epi16 (_m256l a, _m256i b) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PMULHUW—Multiply Packed Unsigned Integers and Store High Result 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF E4 /r' 

PMULHUW mm 7, mmZ/m64 

RM 

V/V 

SSE 

Multiply the packed unsigned word integers in 
mm 7 register and mm2/m64, and store the 
high 16 bits of the results in mm7. 

66 OF E4 /r 

PMULHUW xmm7, xmm2/m128 

RM 

v/v 

SSE2 

Multiply the packed unsigned word integers in 
xmmi and xmm2/ml28, and store the high 

16 bits of the results in xmmi. 

VEX.NDS.1 28.66.0F.WIG E4 /r 

VPMULHUW xmmi, xmm2, xmm3/m128 

RVM 

V/V 

AVX 

Multiply the packed unsigned word integers in 
xmmZ and xmm3/m 128, and store the high 

16 bits of the results in xmmi. 

VEX.NDS.256.66.0F.WIC E4 /r 

VPMULHUW ymml, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Multiply the packed unsigned word integers in 
ymmZ and ymm3/m256, and store the high 

16 bits of the results in ymml. 

EVEX.NDS.128.66.0F.WIG E4 /r 

VPMULHUW xmmi (k1}[z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed unsigned word integers in 
xmm2 and xmm3/m128, and store the high 

16 bits of the results in xmmi under 
writemask k1. 

EVEX.NDS.256.66.0F.WIG E4 /r 

VPMULHUW ymml [k1 }[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed unsigned word integers in 
ymm2 and ymm3/m256, and store the high 

16 bits of the results in ymml under 
writemask k1. 

EVEX.NDS.512.66.0F.WIG E4 /r 

VPMULHUW zmmi {k1}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Multiply the packed unsigned word integers in 
zmm2 and zmm3/m512, and store the high 16 
bits of the results in zmmi under writemask 
k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD unsigned multiply of the packed unsigned word integers in the destination operand (first operand) 
and the source operand (second operand), and stores the high 16 bits of each 32-bit intermediate results in the 
destination operand. (Figure 4-12 shows this operation when using 64-bit operands.) 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory 
location. The destination operand is an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 
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VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. VEX.L must be 0, otherwise the instruction will #UD. 

VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM 
register conditionally updated with writemask kl. 


TEMP 


SRC 

X3 

X2 

XI 

XO 




DEST 

Y3 

Y2 

Y1 

YO 




Z3 = X3 * Y3 

Z2 = X2 * Y2 

Z1 =X1 * Y1 

ZO = XO * YO 



DEST 

Z3[31:16] 

Z2[31:16] 

Z1[31:16] 

Z0[31:16] 



Figure 4-12. PMULHUW and PMULHW Instruction Operation Using 64-bit Operands 


Operation 


PMULHUW (with 64-bit operands) 

TEMP0[31:0]^ 

DEST[15:0] * SRC[15:0]; (* Unsigned multiplication * 

TEMPI [31:0]^ 

DEST[31:16]*SRC[31:16]; 

TEMP2[31:0]^ 

DEST[47:32] * SRC[47:32]; 

TEMP3[31:0]^ 

DEST[63:48] * SRC[63:48]; 

DEST[15:0] ^ 

TEMP0[31:16]; 

DEST[31:16]^ 

TEMPI [31:16]; 

DEST[47:32] ^ 

TEMP2[31:16]; 

DEST[63:48] ^ 

TEMP3[31:16]; 


PMULHUW (with 1 

TEMP0[31:0]^ 
TEMPI [31:0]^ 
TEMP2[31:0]^ 
TEMP3[31:0]^ 
TEMP4[31:0]^ 
TEMP5[31:0]^ 
TEMP6[31:0]^ 
TEMP7[31:0]^ 
DEST[15:0] ^ 
DEST[31:16]^ 
DEST[47:32] ^ 
DEST[63:48] ^ 
DEST[79:64] ^ 
DEST[95:80] ^ 
DEST[111:96]f 
DEST[127:112] 


28-bit operands) 

DEST[15:0] * SRC[15:0]; (* Unsigned multiplication ' 
DEST[31:16]*SRC[31:16]; 

SRC[47:32]; 

SRC[63:48]; 

SRC[79:64]; 

SRC[95:80]; 

SRC[111:96]; 

*SRC[127:112]; 


DEST[47:32] 
DEST[63:48] 
DEST[79:64] 
DEST[95:80] 
DEST[111:96] * 
DEST[127:112] 
TEMP0[31:161 
TEMPI [31:16] 
TEMP2[31:16] 
TEMP3[31:16] 
TEMP4[31:16] 
TEMP5[31:16] 
TEMP6[31:16] 
-TEMP7[31:161 


VPMULHUW (VEX.128 encoded version) 

TEMP0[31:0] ^ SRC1 [15:0] * SRC2[15:0] 
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TEMPI [31:0] ^ SRC1 [31:16] * SRC2[31:16] 
TEMP2[31:0] ^ SRC1 [47:32] * SRC2[47:32] 
TEMP3[31:0] ^ SRC1 [63:48] * SRC2[63:48] 
TEMP4[31:0] ^ SRC1 [79:64] * SRC2[79:64] 
TEMP5[31:0] ^ SRC1 [95:80] * SRC2[95:80] 
TEMP6[31:0] ^ SRC1 [111:96] * SRC2[111:96] 
TEMP7[31:0] ^ SRC1 [127:112] * SRC2[127:112] 
DEST[15:0] ^TEMP0[31:16] 

DEST[31:16] ^TEMP1[31:16] 

DEST[47:32] ^TEMP2[31:16] 

DEST[63:48] ^ TEMP3[31:16] 

DEST[79:64] ^ TEMP4[31:16] 

DEST[95:80] ^ TEMP5[31:16] 
DEST[111:96]^TEMP6[31:16] 

DEST[127:112] ^ TEMP7[31:16] 
DEST[VLMAX-1:128]^0 

PMULHUW (VEX.256 encoded version) 

TEMP0[31:0] ^ SRC1 [15:0] * SRC2[15:0] 

TEMPI [31:0] ^ SRC1 [31:16] * SRC2[31:16] 
TEMP2[31:0] ^ SRC1 [47:32] * SRC2[47:32] 
TEMP3[31:0] ^ SRC1 [63:48] * SRC2[63:48] 
TEMP4[31:0] ^ SRC1 [79:64] * SRC2[79:64] 
TEMP5[31:0] ^ SRC1 [95:80] * SRC2[95:80] 
TEMP6[31:0] ^ SRC1 [111:96] * SRC2[111:96] 
TEMP7[31:0] ^ SRC1 [127:112] * SRC2[127:112] 
TEMP8[31:0] ^ SRC1 [143:128] * SRC2[143:128] 
TEMP9[31:0] ^ SRC1 [159:144] * SRC2[159:144] 
TEMPI 0[31:0] ^ SRC1 [175:160] * SRC2[175:160] 
TEMP11 [31:0] ^ SRC1 [191:176] * SRC2[191:176] 
TEMPI 2[31:0] ^ SRC1 [207:192] * SRC2[207:192] 
TEMPI 3[31:0] ^ SRC1 [223:208] * SRC2[223:208] 
TEMPI 4[31:0] ^ SRC1 [239:224] * SRC2[239:224] 
TEMPI 5[31:0] ^ SRC1 [255:240] * SRC2[255:240] 
DEST[15:0] eTEMP0[31:16] 

DEST[31:16] ^TEMP1[31:16] 

DEST[47:32] ^TEMP2[31:16] 

DEST[63:48] ^ TEMP3[31:16] 

DEST[79:64] ^ TEMP4[31:16] 

DEST[95:80] ^ TEMP5[31:16] 

DEST[111:96] eTEMP6[31:16] 

DEST[127:112] ^ TEMP7[31:16] 

DEST[143:128] ^ TEMP8[31:16] 

DEST[159:144] ^ TEMP9[31:16] 

DEST[175:160] ^TEMP10[31:16] 

DEST[191:176] ^ TEMP11 [31:16] 

DEST[207:192] ^ TEMPI 2[31:16] 

DEST[223:208] ^ TEMPI 3[31:16] 

DEST[239:224] ^ TEMPI 4[31:16] 

DEST[255:240] ^ TEMPI 5[31:16] 
DEST[MAX_VL-1:256]^0 


PMULHUW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 
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FORj^OTO KL-1 
i 16 

IF k10] OR *no writemask* 

THEN 

temp[31:0] ^ SRC1 [i+15:1] * SRC2[i+15:1] 

DEST[I+15:I] ^tmp[31:16] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masklng* ; zeroIng-maskIng 

DEST[I+15:I]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C-r-i- Compiler Intrinsic Equivalent 

VPMULHUW_m5121 _mm512_mulhLepu16(_m5121 a, _m5121 b); 

VPMULHUW_m512i_mm512_mask_mulhi_epu16(_m512i s,_mmask32 k,_m512i a,_m512i b); 

VPMULHUW_m5121 _mm512_maskz_mulhi_epu16(_mmask32 k,_m5121 a,_m5121 b); 

VPMULHUW_m256i_mm256_mask_mulhi_epu16(_m256i s,_mmaskIS k,_m256i a,_m256i b); 

VPMULHUW_m256i_mm256_maskz_mulhi_epu16(_mmasklB k,_m256i a,_m256i b); 

VPMULHUW_m128i_mm_mask_mulhi_epu16(_ml 281 s,_mmask8 k,_ml 281 a,_ml 281 b); 

VPMULHUW_m128i_mm_maskz_mulhi_epu16(_mmask8 k,_m128i a,_ml 281 b); 

PMULHUW:_m64 _mm_mulhi_pu16(_m64 a,_m64 b) 

(V)PMULHUW:_m128i _mm_mulhLepu16 (_m1281 a_ml 281 b) 

VPMULHUW:_m256i _mm256_mulhLepu16 (_m256i a, _m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PMULHW—Multiply Packed Signed Integers and Store High Result 


Opcode/ 

Instruction 

Op/ 

Gn 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF E5 /r' 

PMULHW mm, mm/m64 

RM 

V/V 

MMX 

Multiply the packed signed word integers in mml 
register and mm2/m64, and store the high 16 
bits of the results in mml. 

66 OF E5 Ir 

PMULHW xmmi, xmm2/ml28 

RM 

v/v 

SSE2 

Multiply the packed signed word integers in 
xmm 1 and xmm2/m 128, and store the high 16 
bits of the results in xmml. 

VEX.NDS.1 28.66.0F.WIG E5 /r 

VPMULHW xmml, xmm2, xmm3/ml28 

RVM 

V/V 

AVX 

Multiply the packed signed word integers in 
xmm2 and xmm3/m 128, and store the high 16 
bits of the results in xmml. 

VEX.NDS.256.66.0F.WIG E5 /r 

VPMULHW ymmi, ymmZ, ymm3/mZS6 

RVM 

v/v 

AVX2 

Multiply the packed signed word integers in 
ymmZ and ymm3/m256, and store the high 16 
bits of the results in ymml. 

EVEX.NDS.128.66.0F.WIG E5 /r 

VPMULHW xmml {k1 }{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed signed word integers in 
xmm2 and xmm3/m128, and store the high 16 
bits of the results in xmml under writemask k1. 

EVEX.NDS.256.66.0F.WIG E5 /r 

VPMULHW ymmi [k1 }[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed signed word integers in 
ymm2 and ymm3/m256, and store the high 16 
bits of the results in ymml under writemask k1. 

EVEX.NDS.512.66.0F.WIG E5 /r 

VPMULHW zmmi {k1}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Multiply the packed signed word integers in 
zmm2 and zmm3/m512, and store the high 16 
bits of the results in zmmi under writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD signed multiply of the packed signed word integers in the destination operand (first operand) and 
the source operand (second operand), and stores the high 16 bits of each intermediate 32-bit result in the destina¬ 
tion operand. (Figure 4-12 shows this operation when using 64-bit operands.) 

n 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory 
location. The destination operand is an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. VEX.L must be 0, otherwise the instruction will #UD. 
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VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM 
register conditionally updated with writemask kl. 


Operation 


PMULHW (with 64-bit operands) 

TEMP0[31:0]^ 

DEST[15:0] * SRC[15:0]; (* Signed multiplication * 

TEMPI [31:0]^ 

DEST[31:16]*SRC[31:16]; 

TEMP2[31:0]^ 

DEST[47:32] * SRC[47:32]; 

TEMP3[31:0]^ 

DEST[63:48] * SRC[63:48]; 

DEST[15:0] ^ 

TEMP0[31:16]; 

DEST[31:16]^ 

TEMPI [31:16]; 

DEST[47:32] ^ 

TEMP2[31:16]; 

DEST[63:48] ^ 

TEMP3[31:16]; 


PMULHW (with 128- 

TEMP0[31:0]^ 
TEMPI [31:0]^ 
TEMP2[31:0]^ 
TEMP3[31:0]^ 
TEMP4[31:0]^ 
TEMP5[31:0]^ 
TEMP6[31:0]^ 
TEMP7[31:0]^ 
DEST[15:0] ^ 
DEST[31:16]^ 
DEST[47:32] ^ 
DEST[63:48] ^ 
DEST[79:64] ^ 
DEST[95:80] ^ 
DEST[111:96]^ 
DEST[127:112]f- 


bit operands) 

DEST[15:0] * SRC[15:0]; (* Signed multiplication ^ 
DEST[31:16]*SRC[31:16]; 

DEST[47:32] * SRC[47:32]; 

DEST[63:48] * SRC[63:48]; 

DEST[79:64] * SRC[79:64]; 

DEST[95:80] * SRC[95:80]; 

DEST[111:96] * SRC[111:96]; 
DEST[127:112]*SRC[127:112]; 

TEMP0[31:16]; 

TEMPI [31:16] 

TEMP2[31:16] 

TEMP3[31:16] 

TEMP4[31:16] 

TEMP5[31:16] 

TEMP6[31:16] 

-TEMP7[31:161 


VPMULHW (VEX.128 encoded version) 

TEMP0[31:0] ^ SRC1 [15:0] * SRC2[15:0] (*Signed Multiplication*) 
TEMPI [31:0] ^ SRC1 [31:16] * SRC2[31:16] 

TEMP2[31:0] ^ SRC1 [47:32] * SRC2[47:32] 

TEMP3[31:0] ^ SRC1 [63:48] * SRC2[63:48] 

TEMP4[31:0] ^ SRC1 [79:64] * SRC2[79:64] 

TEMP5[31:0] ^ SRC1 [95:80] * SRC2[95:80] 

TEMP6[31:0] ^ SRC1 [111:96] * SRC2[111:96] 

TEMP7[31:0] ^ SRC1 [127:112] * SRC2[127:112] 

DEST[15:0] ^TEMP0[31:16] 

DEST[31:16] ^TEMP1[31:16] 

DEST[47:32] ^TEMP2[31:16] 

DEST[63:48] ^TEMP3[31:16] 

DEST[79:64] ^ TEMP4[31:16] 

DEST[95:80] ^TEMP5[31:16] 

DEST[111:96] ^TEMP6[31:16] 

DEST[127:112] ^ TEMP7[31:16] 

DEST[VLMAX-1:128]^0 
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PMULHW (VEX.256 encoded version) 

TEMP0[31:0] ^ SRC1 [15:0] * SRC2[15:0] (*Signed Multiplication*) 
TEMPI [31:0] ^ SRC1 [31:16] * SRC2[31:16] 

TEMP2[31:0] ^ SRC1 [47:32] * SRC2[47:32] 

TEMP3[31:0] ^ SRC1 [63:48] * SRC2[63:48] 

TEMP4[31:0] ^ SRC1 [79:64] * SRC2[79:64] 

TEMP5[31:0] ^ SRC1 [95:80] * SRC2[95:80] 

TEMP6[31:0] ^ SRC1 [111:96] * SRC2[111:96] 

TEMP7[31:0] ^ SRC1 [127:112] * SRC2[127:112] 

TEMP8[31:0] ^ SRC1 [143:128] * SRC2[143:128] 

TEMP9[31:0] ^ SRC1 [159:144] * SRC2[159:144] 

TEMPI 0[31:0] ^ SRC1 [175:160] * SRC2[175:160] 

TEMP11 [31:0] ^ SRC1 [191:176] * SRC2[191:176] 

TEMPI 2[31:0] ^ SRC1 [207:192] * SRC2[207:192] 

TEMPI 3[31:0] ^ SRC1 [223:208] * SRC2[223:208] 

TEMPI 4[31:0] ^ SRC1 [239:224] * SRC2[239:224] 

TEMPI 5[31:0] ^ SRC1 [255:240] * SRC2[255:240] 

DEST[15:0] ^TEMP0[31:16] 

DEST[31:16] ^TEMP1[31:16] 

DEST[47:32] ^TEMP2[31:16] 

DEST[63:48] ^ TEMP3[31:16] 

DEST[79:64] ^ TEMP4[31:16] 

DEST[95:80] ^ TEMP5[31:16] 

DEST[111:96]^TEMP6[31:16] 

DEST[127:112] ^ TEMP7[31:16] 

DEST[143:128] ^ TEMP8[31:16] 

DEST[159:144] ^ TEMP9[31:16] 

DEST[175:160] ^ TEMPI 0[31:16] 

DEST[191:176] ^ TEMP11 [31:16] 

DEST[207:192] ^ TEMPI 2[31:16] 

DEST[223:208] ^ TEMPI 3[31:16] 

DEST[239:224] ^ TEMPI 4[31:16] 

DEST[255:240] ^ TEMPI 5[31:16] 

DEST[VLMAX-1:256]^0 


PMULHW (EVEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^j* 16 

IF k1 [j] OR *no writemask* 

THEN 

temp[31:0] ^ SRC1 [i+15:i] * SRC2[I+15:1] 

DEST[i+15:i] ^tmp[31:16] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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Intel C/C++ Compiler Intrinsic Equivalent 

VPMULHW _m512i _mm512_mulhLepl16(_m512i a, _m512i b); 

VPMULHW_mSI 2i _mm512_mask_mulhl_epi16(_mSI 2i s,_rTimask32 k,_mSI 21 a,_mSI 2i b); 

VPMULHW_mSI 21 _mm512_maskz_mulhl_epl16(_mmask32 k,_mSI 2i a,_mSI 21 b); 

VPMULHW_m256i _mm256_mask_mulhl_epi16(_m256i s,_mmaski 6 k,_m256l a,_m256i b); 

VPMULHW_m256i _mm256_maskz_mulhl_epl16(_mmaski 6 k,_m256i a,_m256l b); 

VPMULHW_ml 28i _mm_mask_mulhi_epi16(_ml 281 s,_mmask8 k,_ml 28i a,_ml 281 b); 

VPMULHW_m128i_mm_maskz_mulhl_epl16(_mmask8 k,_ml 281 a,_ml 281 b); 

PMULHW:_m64_mm_mulhl_pl16 (_m64 ml,_m64 m2) 

(V)PMULHW:_m1281 _mm_mulhLepl16 (_m1281 a, _m1281 b) 
VPMULHW:_m256i_mm256_mulhLepl16 (_m256l a_m256i b) 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 38 40 /r 

PMULLD xmnnl, xmm2/m128 

RM 

V/V 

SSE4_1 

Multiply the packed dword signed integers in xmmi and 
xmm2/m128 and store the low 32 bits of each product in 
xmmi. 

VEX.NDS.128.66.0F38.WIG 40 /r 
VPMULLD xmmi, xmm2, 
xmm3/nn128 

RVM 

v/v 

AVX 

Multiply the packed dword signed integers in xmm2 and 
xmm3/m128 and store the low 32 bits of each product in 
xmmi. 

VEX.NDS.256.66.0F38.WIG 40 /r 
VPMULLD ymmi, ymm2, 
ymm3/m256 

RVM 

V/V 

AVX2 

Multiply the packed dword signed integers in ymm2 and 
ymm3/m256 and store the low 32 bits of each product in 
ymmi. 

EVEX.NDS.128.66.0F38.W0 40 /r 
VPMULLD xmmi [k1}[z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply the packed dword signed integers in xmm2 and 
xmm3/m128/m32bcst and store the low 32 bits of each 
product in xmmi under writemask k1. 

EVEX.NDS.256.66.0F38.W0 40 /r 
VPMULLD ymmi {k1}{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Multiply the packed dword signed integers in ymm2 and 
ymm3/m256/m32bcst and store the low 32 bits of each 
product in ymmi under writemask k1. 

EVEX.NDS.512.66.0F38.W0 40 /r 
VPMULLD zmmi {k1}[z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Multiply the packed dword signed integers in zmm2 and 
zmm3/m512/m32bcst and store the low 32 bits of each 
product in zmmi under writemask k1. 

EVEX.NDS.128.66.0F38.W1 40/r 
VPMULLQxmmI {k1}{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512DQ 

Multiply the packed qword signed integers in xmm2 and 
xmm3/m128/m64bcst and store the low 64 bits of each 
product in xmmi under writemask k1. 

EVEX.NDS.256.66.0F38.W1 40 /r 
VPMULLQymmI {k1}[z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512DQ 

Multiply the packed qword signed integers in ymm2 and 
ymm3/m256/m64bcst and store the low 64 bits of each 
product in ymmi under writemask k1. 

EVEX.NDS.51 2.66.0F38.W1 40 /r 
VPMULLQzmmI [k1}[z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512DQ 

Multiply the packed qword signed integers in zmm2 and 
zmm3/m512/m64bcst and store the low 64 bits of each 
product in zmmi under writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD signed multiply of the packed signed dword/qword integers from each element of the first source 
operand with the corresponding element in the second source operand. The low 32/64 bits of each 64/128-bit 
intermediate results are stored to the destination operand. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding ZMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding ZMM register 
are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register; The second source operand is a VMM register 
or 256-bit memory location. Bits (MAX_VL-1:256) of the corresponding destination ZMM register are zeroed. 
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EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is conditionally updated based on writemask kl. 

Operation 

VPMULLQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
i ^ j * 64 

IF kl 0] OR *no writemask* THEN 

IF (EVEX.b == 1) AND (SRC2 *ls memory*) 

THEN Temp[127:0] ^ SRC1 [1+63:1] * SRC2[63:0] 

ELSE Temp[127:0] ^ SRC1 [l+63:i] * SRC2[l+63:i] 

FI; 

DEST[i+63:i] ^ Temp[63:0] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VPMULLD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^]*32 

IF kl 0] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN Temp[63:0] ^ SRC1 [i+31 :i] * SRC2[31:0] 

ELSE Temp[63:0] ^ SRC1 [i+31 :i] * SRC2[i+31 :i] 

FI; 

DEST[i+31:i] ^Temp[31:0] 

ELSE 

IF *merging-masking* ; merging-masking 

*DEST[i+31 :i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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VPMULLD (VEX.256 encoded version) 

Temp0[63:0] <- SRC1 [31:0] * SRC2[31:0] 

Tempi [63:0] <- SRC1 [63:32] * SRC2[63:32] 
Temp2[63:0] <- SRC1 [95:64] * SRC2[95:64] 
Temp3[63:0] <- SRC1 [127:96] * SRC2[127:96] 
Temp4[63:0] <- SRC1 [159:128] * SRC2[159:128] 
Temp5[63:0] <- SRC1 [191:160] * SRC2[191:160] 
Temp6[63:0] <- SRC1 [223:192] * SRC2[223:192] 
Temp7[63:0] <- SRC1 [255:224] * SRC2[255:224] 

DEST[31:0]^Temp0[31:0] 

DEST[63:32] ^ Tempi [31:0] 

DEST[95:64] ^Temp2[31:0] 

DEST[127:96] ^Temp3[31:0] 

DEST[159:128] <- Temp4[31:0] 

DEST[191:160] <- Temp5[31:0] 

DEST[223:192] <- Temp6[31:0] 

DEST[255:224] <- Temp7[31:0] 
DEST[MAX_VL-1:256]^0 


VPMULLD (VEX.128 encoded version) 

Temp0[63:0] <- SRC1 [31:0] * SRC2[31:0] 
Tempi [63:0] <- SRC1 [63:32] * SRC2[63:32] 
Temp2[63:0] <- SRC1 [95:64] * SRC2[95:64] 
Temp3[63:0] <- SRC1 [127:96] * SRC2[127:96] 
DEST[31:0]^Temp0[31:0] 

DEST[63:32] ^ Tempi [31:0] 

DEST[95:64] ^Temp2[31:0] 

DEST[127:96] ^Temp3[31:0] 
DEST[MAX_VL-1:128]^0 


PMULLD (128-bit Legacy SSE version) 

Temp0[63:0] <- DEST[31:0] * SRC[31:0] 

Tempi [63:0] <- DEST[63:32] * SRC[63:32] 

Temp2[63:0] <- DEST[95:64] * SRC[95:64] 

Temp3[63:0] <- DEST[127:96] * SRC[127:96] 

DEST[31:0]^Temp0[31:0] 

DEST[63:32] ^ Tempi [31:0] 

DEST[95:64] ^Temp2[31:0] 

DEST[127:96] ^Temp3[31:0] 

DEST[MAX_VL-1:128] (Unmodified) 

Intei C/C++ Compiier Intrinsic Equivaient 

VPMULLD _m512i _mm512_mullo_epi32(_m5121 a, _m512i b); 

VPMULLD_m512i _mm512_mask_mullo_epi32(_m512i s,_mmask16 k,_m5121 a,_m5121 b); 

VPMULLD_m512i _mm512_maskz_mullo_epi32(_mmaski 6 k,_m5121 a,_m512i b); 

VPMULLD_m256i _mm256_mask_mullo_epi32(_m256i s,_mmask8 k,_m256i a,_m256i b); 

VPMULLD_m256i _mm256_maskz_mullo_epi32(_mmask8 k,_m256i a,_m256i b); 

VPMULLD_m128i_mm_mask_mullo_epi32(_ml 28i s,_mmask8 k,_m128i a,_ml 281 b); 

VPMULLD_m128i_mm_maskz_mullo_epi32(_mmask8 k,_ml 281 a,_ml 281 b); 

VPMULLD _m256i_mm256_mullo_epi32(_m256i a,_m256i b); 

PMULLD_ml 281 _mm_mullo_epi32(_ml 281 a,_ml 281 b); 

VPMULLQ_m512i _mm512_mullo_epi64(_m5121 a_m5121 b); 

VPMULLQ_m5121 _mm512_mask_mullo_epi64(_m512i s,_mmask8 k,_m512i a,_m512i b); 
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VPMULLQ_mSI 2i_mm512_maskz_mullo_epl64(_mmaskS k,_m5121 a,_m512i b); 

VPMULLQ_m256i_mm256_mullo_epi64(_m256l a,_m256l b); 

VPMULLQ_m256i _mm256_mask_mullo_epl64(_m256l s,_mmaskS k,_m256l a,_m256i b); 

VPMULLQ_m256l _mm256_maskz_mullo_epl64(_mmaskS k,_m256l a,_m256i b); 

VPMULLQ_ml 281 _mm_mullo_epl64(_ml 281 a,_ml 281 b); 

VPMULLQ_m128l_mm_mask_mullo_epl64(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPMULLQ_m128l_mm_maskz_mullo_epl64(_mmaskS k,_ml 281 a,_ml 281 b); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4. 
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PMULLW—Multiply Packed Signed Integers and Store Low Result 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF D5 /r' 

PMULLW mm, mm/m64 

RM 

V/V 

MMX 

Multiply the packed signed word integers in 
mm 1 register and mm2/m64, and store the low 
16 bits of the results in mm7. 

66 OF D5 /r 

PMULLW xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Multiply the packed signed word integers in 
xmm 7 and xmm2/m 128, and store the low 16 
bits of the results in xmm7. 

VEX.NDS.1 28.66.0F.WIG D5 It 

VPMULLW xmml, xmm2, xmm3/m128 

RVM 

V/V 

AVX 

Multiply the packed dword signed integers in 
xmm2 and xmm3/m128and store the low 32 
bits of each product in xmml. 

VEX.NDS.256.66.0F.WIG D5 It 

VPMULLW ymm 1, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Multiply the packed signed word integers in 
ymmZ and ymm3/m256, and store the low 16 
bits of the results in ymmi. 

EVEX.NDS.128.66.0F.WIG D5 It 

VPMULLW xmml [k1 }{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed signed word integers in 
xmm2 and xmm3/m128, and store the low 16 
bits of the results in xmml under writemask k1. 

EVEX.NDS.256.66.0F.WIG D5 It 

VPMULLW ymmi [k1}[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Multiply the packed signed word integers in 
ymm2 and ymm3/m256, and store the low 16 
bits of the results in ymmi under writemask k1. 

EVEX.NDS.512.66.0F.WIG 05 It 

VPMULLW zmmi {k1}[z], zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Multiply the packed signed word integers in 
zmm2 and zmm3/m512, and store the low 16 
bits of the results in zmmi under writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD signed multiply of the packed signed word integers in the destination operand (first operand) and 
the source operand (second operand), and stores the low 16 bits of each intermediate 32-bit result in the destina¬ 
tion operand. (Figure 4-12 shows this operation when using 64-bit operands.) 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory 
location. The destination operand is an MMX technology register. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. VEX.L must be 0, otherwise the instruction will #UD. 
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VEX.256 encoded version: The second source operand can be an VMM register or a 256-bit memory location. The 
first source and destination operands are VMM registers. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is conditionally updated 
based on writemask kl. 




SRC 

X3 

X2 

XI 

xo 




DEST 

Y3 

Y2 

Y1 

YO 




TEMP Z3 = X3*Y3 

Z2 = X2 * Y2 

Z1 =X1 * Y1 

ZO = xo * YO 



DEST 

Z3[15:0] 

Z2[15:0] 

Z1 [15:0] 

Z0[15:0] 





Figure 4-13. PMULLU Instruction Operation Using 64-bit Operands 


Operation 


PMULLW (with 64-bit operands) 

TEMP0[31:0]^ 

DEST[15:0] * SRC[15:0]; (* Signed multiplication * 

TEMPI [31:0]^ 

DEST[31:16]*SRC[31:16]; 

TEMP2[31:0]^ 

DEST[47:32] * SRC[47:32]; 

TEMP3[31:0]^ 

DEST[63:48] * SRC[63:48]; 

DEST[15:0] ^ 

TEMP0[15:0]; 

DEST[31:16]^ 

TEMPI [15:0]; 

DEST[47:32] ^ 

TEMP2[15:0]; 

DEST[63:48] ^ 

TEMP3[15:0]; 


PMULLW (with 128-bit operands) 


TEMP0[31:0]^ 

DEST[15:0] * SRC[15:0]; (* Signed multiplication * 

TEMPI [31:0]^ 

DEST[31:16]*SRC[31:16]; 

TEMP2[31:0]^ 

DEST[47:32] * SRC[47:32]; 

TEMP3[31:0]^ 

DEST[63:48] * SRC[63:48]; 

TEMP4[31:0]^ 

DEST[79:64] * SRC[79:64]; 

TEMP5[31:0]^ 

DEST[95:80] * SRC[95:80]; 

TEMP6[31:0]^ 

DEST[111:96] * SRC[111:96]; 

TEMP7[31:0]^ 

DEST[127:112]*SRC[127:112]; 

DEST[15:0] ^ 

TEMP0[15:0]; 

DEST[31:16]^ 

TEMPI [15:0]; 

DEST[47:32] ^ 

TEMP2[15:0]; 

DEST[63:48] ^ 

TEMP3[15:0]; 

DEST[79:64] ^ 

TEMP4[15:0]; 

DEST[95:80] ^ 

TEMP5[15:0]; 

DEST[111:96]^ 

TEMP6[15:0]; 

DEST[127:112]4 

-TEMP7[15:0]; 

DEST[VLMAX-1:256]eO 
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VPMULLW (VEX.1 Z8 encoded version) 

Temp0[31:0] ^ SRC1 [15:0] * SRC2[15:0] 

Tempi [31:0] ^ SRC1 [31:16] * SRC2[31:16] 

Temp2[31:0] ^ SRC1 [47:32] * SRC2[47:32] 

Temp3[31:0] ^ SRC1 [63:48] * SRC2[63:48] 

Temp4[31:0] ^ SRC1 [79:64] * SRC2[79:64] 

Temp5[31:0] ^ SRC1 [95:80] * SRC2[95:80] 

Temp6[31:0] ^ SRC1 [111:96] * SRC2[111:96] 

Temp7[31:0] ^ SRC1 [127:112] * SRC2[127:112] 

DEST[15:0] ^ Temp0[15:0] 

DEST[31:16] ^ Tempi [15:0] 

DEST[47:32] ^Temp2[15:0] 

DEST[63:48] ^ Temp3[15:0] 

DEST[79:64] ^ Temp4[15:0] 

DEST[95:80] ^ Temp5[15:0] 

DEST[111:96]^Temp6[15:0] 

DEST[127:112]^Temp7[15:0] 

DEST[VLMAX-1:128]^0 

PMULLW (EUEX encoded versions) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k1 [j] OR *no writemask* 

THEN 

temp[31:0] ^ SRC1 [i+15:i] * SRC2[I+15:1] 

DEST[i+15:l] ^temp[15:0] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C-F-i- Compiler Intrinsic Equivalent 

VPMULLW _m5121 _mm512_mullo_epl16(_m512i a, _m5121 b); 

VPMULLW_m512l_mm512_mask_mullo_epl16(_m512l s,_mmask32 k,_m512l a,_m512l b); 

VPMULLW_m5121 _mm512_maskz_mullo_epl16(_mmask32 k,_m5121 a,_m5121 b); 

VPMULLW_m256l_mm256_mask_mullo_epl16(_m256l s,_mmask16 k,_m256l a,_m256l b); 

VPMULLW_m256l_mm256_maskz_mullo_epl16(_mmask16 k,_m256l a,_m256l b); 

VPMULLW_m128l_mm_mask_mullo_epl16(_ml 281 s,_mmask8 k,_m128i a,_ml 281 b); 

VPMULLW_ml 281 _mm_maskz_mullo_epl16(_mmask8 k,_ml 281 a,_ml 28i b); 

PMULLW:_m64 _mm_mullo_pl16(_m64 ml,_m64 m2) 

(V)PMULLW: _m1281 _mm_mullo_epi16 (_m128i a, _m128i b) 

VPMULLW:_m256l _mm256_mullo_epl16 (_m256i a, _m256l b); 

Flags Affected 

None. 
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SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PMULUDQ—Multiply Packed Unsigned Doubleword Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF F4 /r' 

PMULUDQ mm 7, mmZ/m64 

RM 

V/V 

SSE2 

Multiply unsigned doubleword integer in mm 7 by 
unsigned doubleword integer in mm2/m64, and 
store the quadword result in mm 7. 

66 OF F4 Ir 

PMULUDQ xmm 7, xmm2/m 7 28 

RM 

v/v 

SSE2 

Multiply packed unsigned doubleword integers in 
xmm 7 by packed unsigned doubleword integers 
in xmm2/m 128, and store the quadword results 
in xmml. 

VEX.NDS.1 28.66.0F.WIG F4 /r 

VPMULUDQ xmml, xmm2, xmm3/m128 

RVM 

V/V 

AVX 

Multiply packed unsigned doubleword integers in 
xmmZ by packed unsigned doubleword integers 
in xmm3/m 128, and store the quadword results 
in xmml. 

VEX.NDS.256.66.0F.WIG F4 /r 

VPMULUDQ ymm 7, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Multiply packed unsigned doubleword integers in 
ymm2 by packed unsigned doubleword integers 
in ymm3/mZ56, and store the quadword results 
in ymmh 

EVEX.NDS.128.66.0F.W1 F4/r 

VPMULUDQ xmml {k1}{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Multiply packed unsigned doubleword integers in 
xmm2 by packed unsigned doubleword integers 
in xmm3/m128/m64bcst, and store the 
quadword results in xmml under writemask k1. 

EVEX.NDS.256.66.0F.W1 F4 /r 

VPMULUDQ ymmi (k1 }[z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Multiply packed unsigned doubleword integers in 
ymm2 by packed unsigned doubleword integers 
in ymm3/m256/m64bcst, and store the 
quadword results in ymmi under writemask k1. 

EVEX.NDS.512.66.0F.W1 F4/r 

VPMULUDQ zmmi [k1 }[z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Multiply packed unsigned doubleword integers in 
zmm2 by packed unsigned doubleword integers 
in zmm3/m512/m64bcst, and store the 
quadword results in zmmi under writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Qperand 2 

Qperand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.wvv (r) 

ModRM:r/m (r) 

NA 


Description 

Multiplies the first operand (destination operand) by the second operand (source operand) and stores the result in 
the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The source operand can be an unsigned doubleword integer stored in the low 
doubleword of an MMX technology register or a 64-bit memory location. The destination operand can be an 
unsigned doubleword integer stored in the low doubleword an MMX technology register. The result is an unsigned 
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quadword integer stored in the destination an MMX technology register. When a quadword result is too large to be 
represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination 
element (that is, the carry is ignored). 

For 64-bit memory operands, 64 bits are fetched from memory, but only the low doubleword is used in the compu¬ 
tation. 

128-bit Legacy SSE version: The second source operand is two packed unsigned doubleword integers stored in the 
first (low) and third doublewords of an XMM register or a 128-bit memory location. For 128-bit memory operands, 
128 bits are fetched from memory, but only the first and third doublewords are used in the computation. The first 
source operand is two packed unsigned doubleword integers stored in the first and third doublewords of an XMM 
register. The destination contains two packed unsigned quadword integers stored in an XMM register. Bits (VLMAX- 
1:128) of the corresponding VMM destination register remain unchanged. 

VEX. 128 encoded version: The second source operand is two packed unsigned doubleword integers stored in the 
first (low) and third doublewords of an XMM register or a 128-bit memory location. For 128-bit memory operands, 
128 bits are fetched from memory, but only the first and third doublewords are used in the computation. The first 
source operand is two packed unsigned doubleword integers stored in the first and third doublewords of an XMM 
register. The destination contains two packed unsigned quadword integers stored in an XMM register. Bits (VLMAX- 
1:128) of the destination VMM register are zeroed. 

VEX.256 encoded version: The second source operand is four packed unsigned doubleword integers stored in the 
first (low), third, fifth and seventh doublewords of a VMM register or a 256-bit memory location. For 256-bit 
memory operands, 256 bits are fetched from memory, but only the first, third, fifth and seventh doublewords are 
used in the computation. The first source operand is four packed unsigned doubleword integers stored in the first, 
third, fifth and seventh doublewords of an VMM register. The destination contains four packed unaligned quadword 
integers stored in an VMM register. 

EVEX encoded version: The input unsigned doubleword integers are taken from the even-numbered elements of 
the source operands. The first source operand is a ZMM/YMM/XMM registers. The second source operand can be an 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64- 
bit memory location. The destination is a ZMM/YMM/XMM register, and updated according to the writemask at 64- 
bit granularity. 

Operation 

PMULUDQ (with 64-Bit operands) 

DEST[63:0] ^ DEST[31:0] * SRC[31:0]; 

PMULUDQ (with 128-Bit operands) 

DEST[63:0] ^ DEST[31:0] * SRC[31:0]; 

DEST[127:64] ^ DEST[95:64] * SRC[95:64]; 

VPMULUDQ (VEX.128 encoded version) 

DEST[63:0] ^ SRC1 [31:0] * SRC2[31:0] 

DEST[127:64] ^ SRC1 [95:64] * SRC2[95:64] 

DEST[VLMAX-1:128]^0 

VPMULUDQ (VEX.256 encoded version) 

DEST[63:0] ^ SRC1 [31:0] * SRC2[31:0] 

DEST[127:64] ^ SRC1 [95:64] * SRC2[95:64 
DEST[191:128] ^ SRC1 [159:128] * SRC2[159:128] 

DEST[255:192] ^ SRC1 [223:192] * SRC2[223:192] 

DEST[VLMAX-1:256]^0 

VPMULUDQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ J * 64 

IF k10] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 
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THEN DEST[l+63:i] ^ ZeroExtend64( SRC1 [i+31 :i]) * ZeroExtend64( SRC2[31:0]) 
ELSE DEST[i+63:i] ^ ZeroExtend64( SRC1 [1+31:!]) * ZeroExtend64( SRC2[I+31 :l]) 
FI; 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C-r-i- Compiler Intrinsic Equivalent 

VPMULUDQ_m5121 _mm512_muLepu32(_m5121 a, _m512i b); 

VPMULUDQ_mSI 21 _mm512_mask_mul_epu32(_m512l s,_mmaskS k,_m512i a,_m512l b); 

VPMULUDQ_mSI 2i _mm512_maskz_mul_epu32(_mmaskS k,_m512i a,_m512l b); 

VPMULUDQ_m256l _mm256_mask_mul_epu32(_m256l s,_mmaskS k,_m256i a,_m256l b); 

VPMULUDQ_m256i _mm256_maskz_mul_epu32(_mmaskS k,_m256i a,_m256l b); 

VPMULUDQ_ml 28i _mm_mask_mul_epu32(_ml 281 s,_mmask8 k,_m128i a,_ml 281 b); 

VPMULUDQ_ml 281 _mm_maskz_mul_epu32(_mmask8 k,_ml 281 a,_m128i b); 

PMULUDQ:_m64 _mm_mul_su32 (_m64 a,_m64 b) 

(V)PMULUDQ:_m128i_mm_muLepu32 (_m128l a,_m128i b) 

VPMULUDQ:_m256l _mm256_muLepu32( _m256i a, _m256l b); 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4. 
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POP—Pop a Value from the Stack 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

8F /O 

POPr/m76 

M 

Valid 

Valid 

Pop top of stack into m 16; increment stack 
pointer. 

8F /O 

POP r/m32 

M 

N.E. 

Valid 

Pop top of stack into m32; increment stack 
pointer. 

8F /O 

POP r/m64 

M 

Valid 

N.E. 

Pop top of stack into m64; increment stack 
pointer. Cannot encode 32-bit operand size. 

58+ rw 

POP r16 

0 

Valid 

Valid 

Pop top of stack into rl6; increment stack 
pointer. 

58+ rd 

POP r32 

0 

N.E. 

Valid 

Pop top of stack into r32; increment stack 
pointer. 

58+ rd 

POP r64 

0 

Valid 

N.E. 

Pop top of stack into r64; increment stack 
pointer. Cannot encode 32-bit operand size. 

IF 

POPDS 

NP 

Invalid 

Valid 

Pop top of stack into DS; increment stack 
pointer. 

07 

POP ES 

NP 

Invalid 

Valid 

Pop top of stack into ES; increment stack 
pointer. 

17 

POPSS 

NP 

Invalid 

Valid 

Pop top of stack into SS; increment stack 
pointer. 

0FA1 

POPFS 

NP 

Valid 

Valid 

Pop top of stack into FS; increment stack 
pointer by 16 bits. 

0FA1 

POPFS 

NP 

N.E. 

Valid 

Pop top of stack into FS; increment stack 
pointer by 32 bits. 

0FA1 

POP FS 

NP 

Valid 

N.E. 

Pop top of stack into FS; increment stack 
pointer by 64 bits. 

0FA9 

POPGS 

NP 

Valid 

Valid 

Pop top of stack into GS; increment stack 
pointer by 16 bits. 

OF A9 

POP GS 

NP 

N.E. 

Valid 

Pop top of stack into GS; increment stack 
pointer by 32 bits. 

OF A9 

POP GS 

NP 

Valid 

N.E. 

Pop top of stack into GS; increment stack 
pointer by 64 bits. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 

0 

opcode + rd (w) 

NA 

NA 

NA 

NP 

NA 

NA 

NA 

NA 


Description 

Loads the value from the top of the stack to the location specified with the destination operand (or explicit opcode) 
and then increments the stack pointer. The destination operand can be a general-purpose register, memory loca¬ 
tion, or segment register. 

Address and operand sizes are determined and used as follows: 

• Address size. The D flag in the current code-segment descriptor determines the default address size; it may be 
overridden by an instruction prefix (67H). 
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The address size is used only when writing to a destination operand in memory. 

• Operand size. The D flag in the current code-segment descriptor determines the default operand size; it may 
be overridden by instruction prefixes (66H or REX.W). 

The operand size (16, 32, or 64 bits) determines the amount by which the stack pointer is incremented (2, 4 
or 8). 

• Stack-address size. Outside of 64-bit mode, the B flag in the current stack-segment descriptor determines the 
size of the stack pointer (16 or 32 bits); in 64-bit mode, the size of the stack pointer is always 64 bits. 

The stack-address size determines the width of the stack pointer when reading from the stack in memory and 
when incrementing the stack pointer. (As stated above, the amount by which the stack pointer is incremented 
is determined by the operand size.) 

If the destination operand is one of the segment registers DS, ES, FS, GS, or SS, the value loaded into the register 
must be a valid segment selector. In protected mode, popping a segment selector into a segment register automat¬ 
ically causes the descriptor information associated with that segment selector to be loaded into the hidden 
(shadow) part of the segment register and causes the selector and the descriptor information to be validated (see 
the "Operation" section below). 

A NULL value (0000-0003) may be popped into the DS, ES, FS, or GS register without causing a general protection 
fault. However, any subsequent attempt to reference a segment whose corresponding segment register is loaded 
with a NULL value causes a general protection exception (#GP). In this situation, no memory reference occurs and 
the saved value of the segment register is NULL. 

The POP instruction cannot pop a value into the CS register. To load the CS register from the stack, use the RET 
instruction. 

If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction 
computes the effective address of the operand after it increments the ESP register. For the case of a 16-bit stack 
where ESP wraps to OH as a result of the POP instruction, the resulting location of the memory write is processor- 
family-specific. 

The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the 
destination. 

A POP SS instruction inhibits all interrupts, including the NMI interrupt, until after execution of the next instruction. 
This action allows sequential execution of POP SS and MOV ESP, EBP instructions without the danger of having an 
invalid stack during an interrupt^. However, use of the LSS instruction is the preferred method of loading the SS 
and ESP registers. 

In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). When in 
64-bit mode, POPs using 32-bit operands are not encodable and POPs to DS, ES, SS are not valid. See the summary 
chart at the beginning of this section for encoding data and limits. 

Operation 

IFStackAddrSize = 32 
THEN 

IF 0perandSize= 32 
THEN 

BEST ^ SS:ESP; (* Copy 
ESP ^ ESP + 4; 

ELSE (* OperandSize = 16*) 

BEST ^ SS:ESP; (* Copy 


1. If a code instruction breakpoint (for debug) is placed on an instruction located immediately after a POP SS instruction, the breakpoint 
may not be triggered. However, in a sequence of instructions that POP the SS register, only the first instruction in the sequence is 
guaranteed to delay an interrupt. 

In the following sequence, interrupts may be recognized before POP ESP executes: 

POP SS 
POP SS 
POP ESP 


a doubleword *) 


a word *) 
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ESP ^ ESP + 2; 

FI; 

ELSE IF StackAddrSIze = 64 
THEN 

IF OperandSize = 64 
THEN 

DEST SS:RSP; (* Copy quadword *) 
RSP ^ RSP + 8; 

ELSE (* OperandSize = 16*) 

DEST ^ SS:RSP; (* Copy a word *) 

RSP ^ RSP + 2; 

FI; 

FI; 

ELSE StackAddrSIze = 16 
THEN 

IF OperandSize = 16 
THEN 

DEST ^ SS:SP; (* Copy a word *) 
SP^SP + 2; 

ELSE (* OperandSize = 32 *) 

DEST SS:SP; (* Copy a doubleword *) 
SP ^ SP + 4; 


Loading a segment register while in protected mode results in special actions, as described in the following listing. 
These checks are performed on the segment selector and the segment descriptor it points to. 

64-BIT_M0DE 

IF FS, or GS is loaded with non-NULL selector; 

THEN 

IF segment selector Index is outside descriptor table limits 
OR segment is not a data or readable code segment 
OR ((segment is a data or nonconforming code segment) 

AND (both RPL and CPL > DPL)) 

THEN #GP(selector); 

IF segment not marked present 
THEN #NP(selector); 

ELSE 

SegmentRegister segment selector; 

SegmentRegister segment descriptor; 

FI; 

FI; 

IF FS, or GS is loaded with a NULL selector; 

THEN 

SegmentRegister <- segment selector; 

SegmentRegister segment descriptor; 


PREOTECTED MODE OR COMPATIBILITY MODE; 
IF SS is loaded; 
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THEN 

IF segment selector Is NULL 
THEN #GP(0); 

FI; 

IF segment selector Index is outside descriptor table limits 
or segment selector's RPL CPL 
or segment is not a writable data segment 
or DPL^^CPL 

THEN #GP(selector); 

FI; 

IF segment not marked present 
THEN #SS(selector); 

ELSE 

SS segment selector; 

SS segment descriptor; 

FI; 

FI; 

IF DS, ES, FS, or GS is loaded with non-NULL selector; 

THEN 

IF segment selector index is outside descriptor table limits 
or segment is not a data or readable code segment 
or ((segment is a data or nonconforming code segment) 
and (both RPL and CPL>DPL)) 

THEN #GP(selector); 

FI; 

IF segment not marked present 
THEN #NP(selector); 

ELSE 

SegmentRegister <- segment selector; 

SegmentRegister <- segment descriptor; 

FI; 

FI; 

IF DS, ES, FS, or GS is loaded with a NULL selector 
THEN 

SegmentRegister segment selector; 

SegmentRegister <- segment descriptor; 


Flags Affected 

None. 


Protected Mode Exceptions 


#GP(0) 


#GP(selector) 


If attempt is made to load SS register with NULL segment selector. 

If the destination operand is in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If segment selector index is outside descriptor table limits. 

If the SS register is being loaded and the segment selector's RPL and the segment descriptor's 
DPL are not equal to the CPL. 
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#SS(0) 

#SS(selector) 

#NP 

#PF(fault-code) 

#AC(0) 

#UD 


If the SS register is being loaded and the segment pointed to is a 
non-writable data segment. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or 
readable code segment. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or 
nonconforming code segment, but both the RPL and the CPL are greater than the DPL. 

If the current top of stack is not within the stack segment. 

If a memory operand effective address is outside the SS segment limit. 

If the SS register is being loaded and the segment pointed to is marked not present. 

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is marked not 
present. 

If a page fault occurs. 

If an unaligned memory reference is made while the current privilege level is 3 and alignment 
checking is enabled. 

If the LOCK prefix is used. 


Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while alignment checking is enabled. 

#UD If the LOCK prefix is used. 


Compatibility Mode Exceptions 

Same as for protected mode exceptions. 


64-Bit Mode Exceptions 


#GP(0) 

#SS(0) 

#GP(selector) 


#AC(0) 

#PF(fault-code) 

#NP 

#UD 


If the memory address is in a non-canonical form. 

If the stack address is in a non-canonical form. 

If the descriptor is outside the descriptor table limit. 

If the FS or GS register is being loaded and the segment pointed to is not a data or readable 
code segment. 

If the FS or GS register is being loaded and the segment pointed to is a data or nonconforming 
code segment, but both the RPL and the CPL are greater than the DPL. 

If an unaligned memory reference is made while alignment checking is enabled. 

If a page fault occurs. 

If the FS or GS register is being loaded and the segment pointed to is marked not present. 

If the LOCK prefix is used. 
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POPA/POPAD—Pop All General-Purpose Reg 

isters 

Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

61 

POPA 

NP 

Invalid 

Valid 

Pop DI, SI, BP, BX, DX, CX, and AX. 

61 

POPAD 

NP 

Invalid 

Valid 

Pop EDI, ESI, EBP, EBX, EDX, ECX, and EAX. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Pops doublewords (POPAD) or words (POPA) from the stack into the general-purpose registers. The registers are 
loaded in the following order: EDI, ESI, EBP, EBX, EDX, ECX, and EAX (if the operand-size attribute is 32) and DI, 
SI, BP, BX, DX, CX, and AX (if the operand-size attribute is 16). (These instructions reverse the operation of the 
PUSHA/PUSHAD instructions.) The value on the stack for the ESP or SP register is ignored. Instead, the ESP or SP 
register is incremented after each register is loaded. 

The POPA (pop all) and POPAD (pop all double) mnemonics reference the same opcode. The POPA instruction is 
intended for use when the operand-size attribute is 16 and the POPAD instruction for when the operand-size attri¬ 
bute is 32. Some assemblers may force the operand size to 16 when POPA is used and to 32 when POPAD is used 
(using the operand-size override prefix [66H] if necessary). Others may treat these mnemonics as synonyms 
(POPA/POPAD) and use the current setting of the operand-size attribute to determine the size of values to be 
popped from the stack, regardless of the mnemonic used. (The D flag in the current code segment's segment 
descriptor determines the operand-size attribute.) 

This instruction executes as described in non-64-bit modes. It is not valid in 64-bit mode. 

Operation 

IF 64-Blt Mode 
THEN 
#UD; 

ELSE 

IF OperandSIze = 32 (* Instruction = POPAD *) 

THEN 

EDI ^ Pop(); 

ESI ^ Pop(); 

EBP ^ Pop(); 

Increment ESP by 4; (* Skip next 4 bytes of stack *) 

EBX ^ Pop(); 

EDX ^ Pop(); 

ECX ^ Pop(); 

EAX ^ Pop(); 

ELSE (* OperandSIze = 16, Instruction = POPA *) 

DI ^ Pop(); 

SI ^ Pop(); 

BP ^ Pop(); 

Increment ESP by 2; (* Skip next 2 bytes of stack *) 

BX ^ Pop(); 

DX ^ Pop(); 

CX ^ Pop(); 

AX ^ Pop(); 

FI; 

FI; 
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Flags Affected 

None. 

Protected Mode Exceptions 

#SS(0) If the starting or ending stack address is not within the stack segment. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment 

checking is enabled. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#SS If the starting or ending stack address is not within the stack segment. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#SS(0) If the starting or ending stack address is not within the stack segment. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while alignment checking is enabled. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same as for protected mode exceptions. 

64-Bit Mode Exceptions 

#UD If in 64-bit mode. 
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POPCNT — Return the Count of Number of Bits Set to 1 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F3 0FB8/r 

POPCNT rl6, r/m 76 

RM 

Valid 

Valid 

POPCNT on r/m 16 

F3 0FB8/r 

POPCNT r32, r/m32 

RM 

Valid 

Valid 

POPCNT on r/m32 

F3 REX.W OF B8 /r 

POPCNT r64, r/m64 

RM 

Valid 

N.E. 

POPCNT on r/m64 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

This instruction calculates the number of bits set to 1 in the second operand (source) and returns the count in the 
first operand (a destination register). 

Operation 

Count = 0; 

For (i=0; I < OperandSize; i++) 

[ IF (SRC[ i] = 1) // I'th bit 
THEN Count++; FI; 

} 

DEST <- Count; 

Flags Affected 

OF, SF, ZF, AF, CF, PF are all cleared. ZF is set if SRC = 0, otherwise ZF is cleared. 

Intel C/C++ Compiler Intrinsic Equivalent 

POPCNT: lnt_mm_popcnt_u32(unslgned int a); 

POPCNT: int64_t _mm_popcnt_u64(unslgned Int64 a); 

Protected Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS or GS segments. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF (fault-code) For a page fault. 

#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment 

checking is enabled. 

#UD If CPUID.01H:ECX.POPCNT [Bit 23] = 0. 

If LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP(0) If any part of the operand lies outside of the effective address space from 0 to OFFFFH. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#UD If CPUID.01H:ECX.POPCNT [Bit 23] = 0. 

If LOCK prefix is used. 
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Virtual 8086 Mode Exceptions 


#GP(0) 

#SS(0) 

#PF (fault-code) 
#AC(0) 

#UD 

If any part of the operand lies outside of the effective address space from 0 to OFFFFFI. 
If a memory operand effective address is outside the SS segment limit. 

For a page fault. 

If an unaligned memory reference is made while alignment checking is enabled. 

If CPUID.01H:ECX.POPCNT [Bit 23] = 0. 

If LOCK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in Protected Mode. 

64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form 


#SS(0) 

#PF (fault-code) 
#AC(0) 

If a memory address referencing the SS segment is in a non-canonical form. 

For a page fault. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If CPUID.01H:ECX.POPCNT [Bit 23] = 0. 

If LOCK prefix is used. 
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POPF/POPFD/POPFQ-Pop Stack into EFLAGS Register 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

90 

POPF 

NP 

Valid 

Valid 

Pop top of stack into lower 16 bits of EFLAGS. 

90 

POPFO 

NP 

N.E. 

Valid 

Pop top of stack into EFLAGS. 

90 

POPFQ 

NP 

Valid 

N.E. 

Pop top of stack and zero-extend into RFLAGS. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Pops a doubleword (POPFD) from the top of the stack (if the current operand-size attribute is 32) and stores the 
value in the EFLAGS register, or pops a word from the top of the stack (if the operand-size attribute is 16) and 
stores it in the lower 16 bits of the EFLAGS register (that is, the FLAGS register). These instructions reverse the 
operation of the PUSHF/PUSHFD instructions. 

The POPF (pop flags) and POPFD (pop flags double) mnemonics reference the same opcode. The POPF instruction 
is intended for use when the operand-size attribute is 16; the POPFD instruction is intended for use when the 
operand-size attribute is 32. Some assemblers may force the operand size to 16 for POPF and to 32 for POPFD. 
Others may treat the mnemonics as synonyms (POPF/POPFD) and use the setting of the operand-size attribute to 
determine the size of values to pop from the stack. 

The effect of POPF/POPFD on the EFLAGS register changes, depending on the mode of operation. See the Table 
4-15 and key below for details. 

When operating in protected, compatibility, or 64-bit mode at privilege level 0 (or in real-address mode, the equiv¬ 
alent to privilege level 0), all non-reserved flags in the EFLAGS register except RF^, VIP, VIF, and VM may be modi¬ 
fied. VIP, VIF and VM remain unaffected. 

When operating in protected, compatibility, or 64-bit mode with a privilege level greater than 0, but less than or 
equal to lOPL, all flags can be modified except the lOPL field and RF^, IF, VIP, VIF, and VM; these remain unaffected. 
The AC and ID flags can only be modified if the operand-size attribute is 32. The interrupt flag (IF) is altered only 
when executing at a level at least as privileged as the lOPL. If a POPF/POPFD instruction is executed with insuffi¬ 
cient privilege, an exception does not occur but privileged bits do not change. 

When operating in virtual-8086 mode (EFLAGS.VM = 1) without the virtual-8086 mode extensions (CR4.VME = 0), 
the POPF/POPFD instructions can be used only if lOPL = 3; otherwise, a general-protection exception (#GP) occurs. 
If the virtual-8086 mode extensions are enabled (CR4.VME = 1), POPF (but not POPFD) can be executed in virtual- 
8086 mode with lOPL < 3. 

In 64-bit mode, the mnemonic assigned is POPFQ (note that the 32-bit operand is not encodable). POPFQ pops 64 
bits from the stack. Reserved bits of RFLAGS (including the upper 32 bits of RFLAGS) are not affected. 

See Chapter 3 of the I ntel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for more informa¬ 
tion about the EFLAGS registers. 


1. RF is always zero after the execution of POPF. This is because POPF, like all instructions, clears RF as It begins to execute. 
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Table 4-15. Effect of POPF/POPFD on the EFLAGS Register 


Mode 

Operand 

Size 

CPL 

lOPL 

Flags 

Notes 

21 

20 

19 

18 

17 

16 

14 

13:12 

11 

10 

9 

8 

7 

6 

4 

2 

0 




ID 

VIP 

VIF 

AC 

VM 

RF 

NT 

lOPL 

OF 

DF 

IF 

TF 

SF 

ZF 

AF 

PF 

CF 


Real-Address 

16 

0 

0-3 

N 

N 

N 

N 

N 

0 

s 

s 

S 

s 

s 

S 

s 

s 

S 

s 

s 


Mode(CRO.PE 
= 0) 

32 

0 

0-3 

S 

N 

N 

S 

N 

0 

s 

s 

s 

s 

s 

s 

s 

s 

S 

s 

s 


Protected, 

16 

0 

0-3 

N 

N 

N 

N 

N 

0 

s 

s 

s 

s 

s 

s 

s 

s 

s 

s 

s 


Compatibility, 
and 64-Blt 

16 

1-3 

<CPL 

N 

N 

N 

N 

N 

0 

s 

N 

s 

s 

N 

s 

s 

s 

s 

s 

s 


Modes 

16 

1-3 

>CPL 

N 

N 

N 

N 

N 

0 

s 

N 

s 

s 

S 

s 

s 

s 

s 

s 

s 


(CRO.PE= 1, 

32, 64 

0 

0-3 

S 

N 

N 

S 

N 

0 

s 

S 

s 

s 

S 

s 

s 

s 

s 

s 

s 


32, 64 

1-3 

<CPL 

S 

N 

N 

S 

N 

0 

s 

N 

s 

s 

N 

s 

s 

s 

s 

s 

s 


EFLACS.VM = 

0) 

32, 64 

1-3 

>CPL 

S 

N 

N 

S 

N 

0 

s 

N 

s 

s 

S 

s 

s 

s 

s 

s 

s 


Vlrtual-8086 

16 

3 

0-2 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

1 

(CRO.PE = 1, 
EFLACS.VM = 

16 

3 

3 

N 

N 

N 

N 

N 

0 

s 

N 

s 

s 

S 

s 

s 

s 

s 

s 

s 


1, 

32 

3 

0-2 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

1 

CR4.VME = 0) 

32 

3 

3 

S 

N 

N 

S 

N 

0 

s 

N 

s 

s 

S 

s 

s 

s 

s 

s 

s 


VME 

16 

3 

0-2 

N/ 

N/ 

SV/ 

N/ 

N/ 

0/ 

s/ 

N/X 

s/ 

s/ 

N/ 

s/ 

s/ 

s/ 

s/ 

s/ 

s/ 

2 

(CRO.PE = 1, 




X 

X 

X 

X 

X 

X 

X 


X 

X 

X 

X 

X 

X 

X 

X 

X 


bhLAUb.VM = 

1, 

16 

3 

3 

N 

N 

N 

N 

N 

0 

s 

N 

s 

s 

S 

s 

s 

s 

s 

s 

s 


CR4.VME = 1) 

32 

3 

0-2 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

1 


32 

3 

3 

S 

N 

N 

S 

N 

0 

s 

N 

s 

s 

S 

s 

s 

s 

s 

s 

s 



NOTES: 

1. #GP fault - no flag update 

2. #GP fault with no flag update if \/IP=1 in EFLAGS register and IF=1 in FLAGS value on stack 


Key 

S 

Updated from stack 

SV 

Updated from IF (bit 9) in FLAGS value on stack 

N 

No change in value 

X 

No EFLAGS update 

0 

Value is cleared 


Operation 

IF VM = 0 (* Not in Virtual-8086 Mode *) 

THENIFCPL=0 

THEN 

IF OperandSize = 32; 

THEN 

EFLAGS ^ Pop(); (* 32-bit pop *) 

(* All non-reserved flags except RF, VIP, VIF, and VM can be modified; 
VIP, VIF, VM, and all reserved bits are unaffected. RF is cleared. *) 
ELSE IF (Operandsize = 64) 

RFLAGS = Pop(); (* 64-bit pop *) 

(* All non-reserved flags except RF, VIP, VIF, and VM can be modified; 
VIP, VIF, VM, and all reserved bits are unaffected. RF is cleared. *) 
ELSE (* OperandSize =16*) 
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EFLAGS[15:0] ^ Pop(); (* 16-blt pop *) 

(* All non-reserved flags can be modified. *) 

FI; 

ELSE (* CPL > 0 *) 

IF OperandSize= 32 
THEN 

IF CPL > lOPL 
THEN 

EFLAGS ^ Pop(); (* 32-bit pop *) 

(* All non-reserved bits except IF, lOPL, VIP, VIF, VM and RF can be modified; 
IF, lOPL, VIP, VIF, VM and all reserved bits are unaffected; RF is cleared. *) 

ELSE 

EFLAGS ^ Pop(); (* 32-bit pop *) 

(* All non-reserved bits except lOPL, VIP, VIF, VM and RF can be modified; 
lOPL, VIP, VIF, VM and all reserved bits are unaffected; RF is cleared. *) 

FI; 

ELSE IF (Operandsize = 64) 

IF CPL > lOPL 
THEN 

RFLAGS ^ Pop(); (* 64-bit pop *) 

(* All non-reserved bits except IF, lOPL, VIP, VIF, VM and RF can be modified; 
IF, lOPL, VIP, VIF, VM and all reserved bits are unaffected; RF is cleared. *) 

ELSE 

RFLAGS ^ Pop(); (* 64-bit pop *) 

(* All non-reserved bits except lOPL, VIP, VIF, VM and RF can be modified; 
lOPL, VIP, VIF, VM and all reserved bits are unaffected; RF is cleared. *) 

FI; 

ELSE (* Operandsize =16*) 

EFLAGS[15:0] ^ Pop(); (* 16-bit pop *) 

(* All non-reserved bits except lOPL can be modified; lOPL and all 
reserved bits are unaffected. *) 

FI; 

FI; 

ELSE IF CR4.VME = 1 (* In Virtual-8086 Mode with VME Enabled *) 

IF lOPL =3 

THEN IF Operandsize =32 
THEN 

EFLAGS ^ Pop(); 

(* All non-reserved bits except lOPL, VIP, VIF, VM, and RF can be modified; 

VIP, VIF, VM, lOPL and all reserved bits are unaffected. RF is cleared. *) 

ELSE 

EFLAGS[15:0] ^ Pop(); FI; 

(* All non-reserved bits except lOPL can be modified; 
lOPL and all reserved bits are unaffected. *) 

FI; 

ELSE (* lOPL < 3 *) 

IF (Operandsize = 32) 

THEN 

#GP(0); (* Trap to virtual-8086 monitor. *) 

ELSE (* Operandsize =16*) 
tempFLAGS <- Pop(); 

IF EFLAGS.VIP = 1 AND tempFLAGS[9] = 1 
THEN #GP(0); 

ELSE 
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EFLAGS.VIF ^ tempFLAGS[9]; 

EFLAGS[15:0] ^ tempFLAGS; 

(* All non-reserved bits except lOPL and IF can be modified; 
lOPL, IF, and all reserved bits are unaffected. *) 

FI; 

FI; 

FI; 

ELSE (* In Virtual-8086 Mode *) 

IF lOPL =3 

THENIFOperandSize=32 

THEN 

EFLAGS ^ Pop(); 

(* All non-reserved bits except lOPL, VIP, VIF, VM, and RF can be modified; 

VIP, VIF, VM, lOPL and all reserved bits are unaffected. RF is cleared. *) 

ELSE 

EFLAGS[15:0] ^ Pop(); FI; 

(* All non-reserved bits except lOPL can be modified; 
lOPL and all reserved bits are unaffected. *) 

ELSE (* lOPL < 3 *) 

#GP(0); (* Trap to virtual-8086 monitor. *) 

FI; 

FI; 

FI; 

Flags Affected 

All flags may be affected; see the Operation section for details. 

Protected Mode Exceptions 

#SS(0) If the top of stack is not within the stack segment. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment 

checking is enabled. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#SS If the top of stack is not within the stack segment. 

#UD If the LOCK prefix is used. 

\/irtual-8086 Mode Exceptions 

#GP(0) If the I/O privilege level is less than 3. 

If an attempt is made to execute the POPF/POPFD instruction with an operand-size override 
prefix. 

#SS(0) If the top of stack is not within the stack segment. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while alignment checking is enabled. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same as for protected mode exceptions. 
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64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form 


#SS(0) 

#PF(fault-code) 

#AC(0) 

If the stack address is in a non-canonical form. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If the LOCK prefix is used. 
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POR—Bitwise Logical OR 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF EB /r' 

POR mm, mm/m64 

RM 

V/V 

MMX 

Bitwise OR of mm/m64 and mm. 

66 OF EB Ir 

POR xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Bitwise OR of xmm2/ml28and xmmi. 

VEX.NDS.128.66.0F.WIG EB/r 

VPOR xmm 1, xmmZ, xmm3/m 128 

RVM 

V/V 

AVX 

Bitwise OR of xmm2/m 128 and xmm3. 

VEX.NDS.256.66.0F.WIG EB /r 

VPOR ymmi, \/mm2, ymm3/mZS6 

RVM 

v/v 

AVX2 

Bitwise OR of ymm2/m256 and ymm3. 

EVEX.NDS.128.66.0F.W0 EB /r 

VPORD xmmi {k1}{z}, xmm2, xmm3/m12B/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise OR of packed doubleword integers in 
xmm2 and xmm3/m128/m32bcst using 
writemask k1. 

EVEX.NDS.256.66.0F.W0 EB /r 

VPORD ymmi {k1 }[z}, ymm2, ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise OR of packed doubleword integers in 
ymm2 and ymm3/m256/m32bcst using 
writemask k1. 

EVEX.NDS.51 2.66.0F.W0 EB /r 

VPORD zmmi {k1 }{z], zmm2, zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Bitwise OR of packed doubleword integers in 
zmm2 and zmm3/m512/m32bcst using 
writemask k1. 

EVEX.NDS.128.66.0F.W1 EB/r 

VPORQ xmmi {k1 }{z}, xmm2, xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise OR of packed quadword integers in 
xmm2 and xmm3/m128/m64bcst using 
writemask k1. 

EVEX.NDS.256.66.0F.W1 EB/r 

VPORQ ymmi (k1}(z}, ymm2, ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Bitwise OR of packed quadword integers in 
ymm2 and ymm3/m256/m64bcst using 
writemask kl. 

EVEX.NDS.51 2.66.0F.W1 EB/r 

VPORQ zmmi {k1}[z}, zmm2, zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Bitwise OR of packed quadword integers in 
zmm2 and zmm3/m512/m64bcst using 
writemask kl. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a bitwise logical OR operation on the source operand (second operand) and the destination operand (first 
operand) and stores the result in the destination operand. Each bit of the result is set to 1 if either or both of the 
corresponding bits of the first and second operands are 1; otherwise, it is set to 0. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 
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Legacy SSE version: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand is an MMX technology register. 

128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first 
source and destination operands can be XMM registers. Bits (VLMAX-1:128) of the corresponding VMM destination 
register remain unchanged. 

VEX. 128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first 
source and destination operands can be XMM registers. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 

VEX.256 encoded version: The second source operand is an VMM register or a 256-bit memory location. The first 
source and destination operands can be VMM registers. 

EVEX encoded version: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with 
writemask kl at 32/64-bit granularity. 

Operation 

POR (64-bit operand) 

BEST ^ BEST OR SRC 


POR (128-bit Legacy SSE version) 

BEST ^ BEST OR SRC 
0EST[VLMAX-1:128] (Unmodified) 

VPOR (VEX.128 encoded version) 

BEST ^ SRC1 OR SRC2 
0EST[VLMAX-1:128]^0 

VPOR (VEX.256 encoded version) 

BEST ^ SRC1 OR SRC2 
0EST[VLMAX-1:256]^0 

VPORD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF kl [j] OR *no writemask* THEN 

IF (EVEX.b = 1) ANO (SRC2 *ls memory*) 

THEN 0EST[I+31 :i] ^ SRC1 [i-r31 :i] BITWISE OR SRC2[31:0] 
ELSE 0EST[i+31 :l] ^ SRC1 [i+31 :i] BITWISE OR SRC2[I+31 :i] 
FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

*0EST[i+31 :l] remains unchanged* 

ELSE ; zeroing-masking 

0EST[i+31:i]^0 
FI; 

FI; 

ENOFOR; 

0EST[MAX_VL-1 :VL] ^ 0 
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Intel C/C++ Compiler Intrinsic Equivalent 

VPORD _rTi512i _mm512_or_epi32(_m5121 a, _m5121 b); 

VPORD_mSI 2i _mm512_mask_or_epl32(_m512l s,_mmasklE k,_rTi512i a,_m512l b); 

VPORD_mSI 2i_mm512_maskz_or_epl32(_mmaski 6 k,_m5121 a,_m512i b); 

VPORD_m256i _mm256_or_epi32(_m256l a,_m256l b); 

VPORD_rTi256l _mm256_mask_or_epl32(_m256l s,_mmaskS k,_m256l a,_m256i b,); 

VPORD_rTi256i _mm256_maskz_or_epl32(_mmaskS k,_m256l a,_m256l b); 

VPORD_m128i_mm_or_epl32(_ml 281 a,_ml 281 b); 

VPORD_m128i_mm_mask_or_epl32(_ml 281 s,_mmask8 k,_ml 281 a,_ml 281 b); 

VPORD_m128i_mm_maskz_or_epl32(_mmask8 k,_ml 281 a,_ml 28i b); 

VPORQ_m512i _mm512_or_epl64(_m512i a,_m5121 b); 

VPORQ_mSI 2i _mm512_mask_or_epi64(_mSI 21 s,_mmask8 k,_mSI 21 a,_mSI 21 b); 

VPORQ_m512i_mm512_maskz_or_epl64(_mmask8 k,_m512l a,_m512i b); 

VPORQ_m256l _mm256_or_epl64(_m256i a, int Imm); 

VPORQ_m256i _mm256_mask_or_epi64(_m256l s,_mmask8 k,_m256l a,_m256i b); 

VPQRQ_m256i _mm256_maskz_or_epl64(_mmask8 k,_m256i a,_m256l b); 

VPQRQ_ml 28i _mm_or_epl64(_ml 281 a,_ml 281 b); 

VPQRQ_ml 281 _mm_mask_or_epl64(_ml 281 s,_mmask8 k,_ml 281 a,_m128i b); 

VPQRQ_m128l_mm_maskz_or_epl64(_mmask8 k,_ml 281 a,_ml 281 b); 

PQR_m64 _mm_or_sl64(_m64 ml,_m64 m2) 

(V)PQR:_ml 281 _mm_or_si128(_ml 281 ml,_ml 281 m2) 

VPQR: _m256l _mm256_or_si256 (_m256i a, _m256l b) 


Flags Affected 

None. 


SIMD Floating-Point Exceptions 

None. 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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PR6FETCH/7—Prefetch Data Into Caches 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 18/1 

PREFETCHTO mS 

M 

Valid 

Valid 

Move data from mS closer to the processor 
using TO hint. 

OF 18/2 

PREFETCHT1 mS 

M 

Valid 

Valid 

Move data from mS closer to the processor 
using T1 hint. 

OF 18/3 

PREFETCHT2 mS 

M 

Valid 

Valid 

Move data from mS closer to the processor 
using T2 hint. 

OF 18 /O 

PREFETCHNTA mS 

M 

Valid 

Valid 

Move data from mS closer to the processor 
using NTA hint. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r) 

NA 

NA 

NA 


Description 

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the 
cache hierarchy specified by a locality hint: 

• TO (temporal data)—prefetch data into all levels of the cache hierarchy. 

• T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher. 

• T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or 
an implementation-specific choice. 

• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and 
into a location close to the processor, minimizing cache pollution. 

The source operand is a byte memory location. (The locality hints are encoded into the machine level instruction 
using bits 3 through 5 of the ModR/M byte.) 

If the line selected is already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 

The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction 
moves data closer to the processor in anticipation of future use. 

The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a 
processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, 
however, be a minimum of 32 bytes. Additional details of the implementation-dependent locality hints are 
described in Section 7.4 of Intel® 64 and IA-32 Architectures Optimization Reference Manual. 

It should be noted that processors are free to speculatively fetch and cache data from system memory regions that 
are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A 
PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur 
at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the 
fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also 
unordered with respect to CLFLUSH and CLFLUSHOPT instructions, other PREFETCHh instructions, or any other 
general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

FETCH (m8); 
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Intel C/C++ Compiler Intrinsic Equivalent 

void _mm_prefetch(char *p, int I) 

The argument "*p" gives the address of the byte (and corresponding cache line) to be prefetched. The value "i" 
gives a constant (_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, or _MM_HINT_NTA) that specifies the type of 
prefetch operation to be performed. 

Numeric Exceptions 

None. 

Exceptions (All Operating Modes) 

#UD If the LOCK prefix is used. 
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PR6FETCHW—Prefetch Data into Caches in Anticipation of a Write 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF OD/1 

PREFETCHW m8 

A 

V/V 

PRFCHW 

Move data from m8 closer to the processor in anticipation of a 
write. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r) 

NA 

NA 

NA 


Description 

Fetches the cache line of data from memory that contains the byte specified with the source operand to a location 
in the 1st or 2nd level cache and invalidates other cached instances of the line. 

The source operand is a byte memory location. If the line selected is already present in the lowest level cache and 
is already in an exclusively owned state, no data movement occurs. Prefetches from non-writeback memory are 
ignored. 

The PREFETCHW instruction is merely a hint and does not affect program behavior. If executed, this instruction 
moves data closer to the processor and invalidates other cached copies in anticipation of the line being written to 
in the future. 

The characteristic of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a 
processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, 
however, be a minimum of 32 bytes. Additional details of the implementation-dependent locality hints are 
described in Section 7.4 of Intel® 64 and IA-32 Architectures Optimization Reference Manual. 

It should be noted that processors are free to speculatively fetch and cache data with exclusive ownership from 
system memory regions that permit such accesses (that is, the WB memory type). A PREFETCHW instruction is 
considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not 
tied to instruction execution, a PREFETCHW instruction is not ordered with respect to the fence instructions 
(MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHW instruction is also unordered with 
respect to CLFLUSH and CLFLUSHOPT instructions, other PREFETCHW instructions, or any other general instruction 

It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

FETCH_WITH_EXCLUSIVE_0WNERSHIP(m8); 

Flags Affected 

All flags are affected 

C/C++ Compiler Intrinsic Equivalent 

void _m_prefetchw( void *); 

Protected Mode Exceptions 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 
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\/irtual-8086 Mode Exceptions 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

#UD If the LOCK prefix is used. 

64-Bit Mode Exceptions 

#UD If the LOCK prefix is used. 
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PR6FETCHWT1—Prefetch Vector Data Into Caches with Intent to Write and T1 Hint 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID Feature 
Flag 

Description 

OF OD /2 

PREFETCHWT1 m8 

M 

V/V 

PREFETCHWT1 

Move data from m8 closer to the processor using T1 hint 
with intent to write. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r) 

NA 

NA 

NA 


Description 

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the 
cache hierarchy specified by an intent to write hint (so that data is brought into 'Exclusive' state via a request for 
ownership) and a locality hint: 

• T1 (temporal data with respect to first level cache)—prefetch data into the second level cache. 

The source operand is a byte memory location. (The locality hints are encoded into the machine level instruction 
using bits 3 through 5 of the ModR/M byte. Use of any ModR/M value other than the specified ones will lead to 
unpredictable behavior.) 

If the line selected is already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 

The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction 
moves data closer to the processor in anticipation of future use. 

The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a 
processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, 
however, be a minimum of 32 bytes. 

It should be noted that processors are free to speculatively fetch and cache data from system memory regions that 
are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A 
PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur 
at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the 
fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also 
unordered with respect to CLFLUSH and CLFLUSHOPT instructions, other PREFETCHh instructions, or any other 
general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

PREFETCH(mem, Level, State) Prefetches a byte memory location pointed by 'mem' into the cache level specified by 'Level'; a request 
for exclusive/ownership is done if 'State' is 1. Note that the memory location ignore cache line splits. This operation is considered a 
hint for the processor and may be skipped depending on implementation. 

Prefetch (m8. Level = 1, EXCLUSIVE=1); 

Flags Affected 

All flags are affected 

C/C++ Compiler Intrinsic Equivalent 

void _mm_prefetch( char const *, int hint= _MM_HINT_ET1); 

Protected Mode Exceptions 

#UD If the LOCK prefix is used. 


4-406 Vol. 2B 


PREFETCHWT1 —Prefetch Vector Data Into Caches with Intent to Write and T1 Hint 














INSTRUCTION SET REFERENCE, M-U 


Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 

\/irtual-8086 Mode Exceptions 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

#UD If the LOCK prefix is used. 

64-Bit Mode Exceptions 

#UD If the LOCK prefix is used. 
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PSADBW—Compute Sum of Absolute Differences 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID Feature 
Flag 

Description 

OF F6 /r' 

PSADBW mm 1, mmZ/m64 

RM 

V/V 

SSE 

Computes the absolute differences of the 
packed unsigned byte integers from mm2 
/m64and mmV, differences are then summed 
to produce an unsigned word integer result. 

66 OF F6 /r 

PSADBW xmmi, xmmZ/m 1Z8 

RM 

v/v 

SSE2 

Computes the absolute differences of the 
packed unsigned byte integers from xmmZ 
/m 128 and xmm V, the 8 low differences and 8 
high differences are then summed separately 
to produce two unsigned word integer results. 

VEX.NDS.128.66.0F.WIG F6 /r 

VPSADBW xmm 1, xmmZ, xmm3/m 128 

RVM 

V/V 

AVX 

Computes the absolute differences of the 
packed unsigned byte integers from xmm3 
/ml28and xmmZ; the 8 low differences and 8 
high differences are then summed separately 
to produce two unsigned word integer results. 

VEX.NDS.256.66.0F.WIG F6 /r 

VPSADBW ymmi, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Computes the absolute differences of the 
packed unsigned byte integers from ymm3 
/m256 and ymmZ; then each consecutive 8 
differences are summed separately to produce 
four unsigned word integer results. 

EVEX.NDS.128.66.0F.WIG F6 /r 

VPSADBW xmmi, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Computes the absolute differences of the 
packed unsigned byte integers from xmm3 
/ml 28 and xmm2; then each consecutive 8 
differences are summed separately to produce 
four unsigned word integer results. 

EVEX.NDS.256.66.0F.WIG F6 /r 

VPSADBW ymmi, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Computes the absolute differences of the 
packed unsigned byte integers from ymm3 
/m256 and ymm2; then each consecutive 8 
differences are summed separately to produce 
four unsigned word integer results. 

EVEX.NDS.512.66.0F.WIG F6 /r 

VPSADBW zmmi, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Computes the absolute differences of the 
packed unsigned byte integers from zmm3 
/m512 and zmm2; then each consecutive 8 
differences are summed separately to produce 
four unsigned word integer results. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" In the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel" 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv 

ModRM:r/m (r) 

NA 


Description 

Computes the absolute value of the difference of 8 unsigned byte integers from the source operand (second 
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operand) and from the destination operand (first operand). These 8 differences are then summed to produce an 
unsigned word integer result that is stored in the destination operand. Figure 4-14 shows the operation of the 
PSADBW instruction when using 64-bit operands. 

When operating on 64-bit operands, the word integer result is stored in the low word of the destination operand, 
and the remaining bytes in the destination operand are cleared to all Os. 

When operating on 128-bit operands, two packed results are computed. Flere, the 8 low-order bytes of the source 
and destination operands are operated on to produce a word result that is stored in the low word of the destination 
operand, and the 8 high-order bytes are operated on to produce a word result that is stored in bits 64 through 79 
of the destination operand. The remaining bytes of the destination operand are cleared. 

For 256-bit version, the third group of 8 differences are summed to produce an unsigned word in bits[143:128] of 
the destination register and the fourth group of 8 differences are summed to produce an unsigned word in 
bits[207:192] of the destination register. The remaining words of the destination are set to 0. 

For 512-bit version, the fifth group result is stored in bits [271:256] of the destination. The result from the sixth 
group is stored in bits [335:320]. The results for the seventh and eighth group are stored respectively in bits 
[399:384] and bits [463:447], respectively. The remaining bits in the destination are set to 0. 

In 64-bit mode and not encoded by VEX/EVEX prefix, using a REX prefix in the form of REX.R permits this instruc¬ 
tion to access additional registers (XMM8-XMM15). 

Legacy SSE version: The source operand can be an MMX technology register or a 64-bit memory location. The 
destination operand is an MMX technology register. 

128-bit Legacy SSE version: The first source operand and destination register are XMM registers. The second 
source operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corresponding ZMM 
destination register remain unchanged. 

VEX. 128 and EVEX.128 encoded versions: The first source operand and destination register are XMM registers. The 
second source operand is an XMM register or a 128-bit memory location. Bits (MAX_VL-1:128) of the corre¬ 
sponding ZMM register are zeroed. 

VEX.256 and EVEX.256 encoded versions: The first source operand and destination register are VMM registers. The 
second source operand is an VMM register or a 256-bit memory location. Bits (MAX_VL-1:256) of the corre¬ 
sponding ZMM register are zeroed. 

EVEX.512 encoded version: The first source operand and destination register are ZMM registers. The second 
source operand is a ZMM register or a 512-bit memory location. 




SRC 

X7 

X6 

X5 

X4 

X3 

X2 

XI 

xo 




DEST 

Y7 

Y6 

Y5 

Y4 

Y3 

Y2 

Y1 

YO 




TEMP 

ABS{X7:Y7) 

ABS{X6:Y6) 

ABS(X5:Y5) 

ABS(X4:Y4) 

ABS(X3:Y3) 

ABS(X2:Y2) 

ABS(X1:Y1) 

ABS(X0:Y0) 




DEST 

OOH 

OOH 

OOH 

OOH 

OOH 

OOH 

SUM(TEMP7...TEMPO) 





Figure 4-14. PSADBW Instruction Operation Using 64-bit Operands 


Operation 

VPSADBW (EVEX encoded versions) 

VL= 128,256,512 

TEMPO <- ABS(SRC1 [7:0] - SRC2[7:0]) 

(* Repeat operation for bytes 1 through 15 *) 
TEMPI 5 <- ABS(SRC1 [127:120] - SRC2[127:120]) 
DEST[15:0] ^SUM(TEMP0:TEMP7) 

DEST[63:16] <r OOOOOOOOOOOOH 
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DEST[79:64] <- SUM(TEMP8:TEMP15) 

DEST[127:80] <- OOOOOOOOOOOH 

IFVL>=256 

(* Repeat operation for bytes 16 through 31 *) 
TEMP31 <- ABS(SRC1 [255:248] - SRC2[255:248]) 
DEST[143:128] ^SUM(TEMP16:TEMP23) 

DEST[191:144] <- OOOOOOOOOOOOH 
DEST[207:192] <- SUM(TEMP24:TEMP31) 
DEST[223:208] <- OOOOOOOOOOOH 

FI; 

IFVL>=512 

(* Repeat operation for bytes 32 through 63*) 

TEMP63 <- ABS(SRC1 [511:504] - SRC2[511:504]) 
DEST[271:256] ^SUM(TEMP0:TEMP7) 

DEST[319:272] <- OOOOOOOOOOOOH 
DEST[335:320] <- SUM(TEMP8:TEMP15) 
DEST[383:336] <- OOOOOOOOOOOH 
DEST[399:384] ^SUM(TEMP16:TEMP23) 
DEST[447:400] <- OOOOOOOOOOOOH 
DEST[463:448] <- SUM(TEMP24:TEMP31) 

DEST[511:464] <- OOOOOOOOOOOH 

FI; 

DEST[MAX_VL-1 :VL] <- 0 


VPSADBW (VEX.256 encoded version) 

TEMPO <- ABS(SRC1 [7:0] - SRC2[7:0]) 

(* Repeat operation for bytes 2 through 30*) 
TEMP31 <- ABS(SRC1 [255:248] - SRC2[255:248]) 
DEST[15:0] ^SUM(TEMP0:TEMP7) 

DEST[63:16] <- OOOOOOOOOOOOH 
DEST[79:64] <- SUM(TEMP8:TEMP15) 

DEST[127:80] <- OOOOOOOOOOOH 
DEST[143:128] ^SUM(TEMP16:TEMP23) 

DEST[191:144] ^ OOOOOOOOOOOOH 
DEST[207:192] <- SUM(TEMP24:TEMP31) 
DEST[223:208] <- OOOOOOOOOOOH 
DEST[MAX_VL-1:256]^0 
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VPSADBW (VEX.1 28 encoded version) 

TEMPO <- ABS(SRC1 [7:0] - SRC2[7:0]) 

(* Repeat operation for bytes 2 through 14 *) 
TEMPI 5 <- ABS(SRC1 [127:120] - SRC2[127:120]) 
DEST[15:0] ^SUM(TEMP0:TEMP7) 

DEST[63:16] <- OOOOOOOOOOOOH 
DEST[79:64] <- SUM(TEMP8:TEMP15) 

DEST[127:80] <- OOOOOOOOOOOH 
DEST[MAX_VL-1:128]^0 


PSADBW (128-bit Legacy SSE version) 

TEMPO <- ABS(DEST[7:0] - SRC[7:0]) 

(* Repeat operation for bytes 2 through 14 *) 
TEMPI 5 <- ABS(DEST[127:120] - SRC[127:120]) 
DEST[15:0] ^SUM(TEMP0:TEMP7) 

DEST[63:16] <- OOOOOOOOOOOOH 
DEST[79:64] <- SUM(TEMP8:TEMP15) 

DEST[127:80] <- 00000000000 
DEST[MAX_VL-1:128] (Unmodified) 


PSADBW (64-bit operand) 

TEMPO <- ABS(DEST[7:0] - SRC[7:0]) 

(* Repeat operation for bytes 2 through 6 *) 

TEMP7 <- ABS(DEST[63:56] - SRC[63:56]) 

DEST[15:0] ^SUM(TEMP0:TEMP7) 

DEST[63:16] <- OOOOOOOOOOOOH 

Intel C/C++ Compiler Intrinsic Equivalent 

VPSADBW _m512i _mm512_sad_epu8(_m512i a,_m512i b) 

PSADBW:_m64 _mm_sad_pu8(_m64 a,_m64 b) 

(V)PSADBW:_ml 28i _mm_sad_epu8(_ml 28i a,_ml 28i b) 

VPSADBW:_m256i _mm256_sad_epu8(_m256i a,_m256i b) 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PSHUFB — Packed Shuffle Bytes 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 00 /r' 

PSHUFB mm 7, mmZ/m64 

RM 

V/V 

SSSE3 

Shuffle bytes in mm 7 according to contents of 
mm2/m64. 

66 OF 38 00 /r 

PSHUFB xmmi, xmm2/ml28 

RM 

v/v 

SSSE3 

Shuffle bytes In xmmi according to contents of 
xmm2/m128. 

VEX.NDS.128.66.0F38.WIG 00 /r 

VPSHUFB xmmI, xmm2, xmm3/m128 

RVM 

V/V 

AVX 

Shuffle bytes In xmm2 according to contents of 
xmm3/m 128. 

VEX.NDS.256.66.0F38.WIG 00 /r 

VPSHUFB ymm 7, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Shuffle bytes In ymm2 according to contents of 
ymm3/m256. 

EVEX.NDS.128.66.0F38.WIG 00 It 

VPSHUFB xmmi {k1 }{z], xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Shuffle bytes In xmm2 according to contents of 
xmm3/m128 under write mask k1. 

EVEX.NDS.256.66.0F38.WIG 00 It 

VPSHUFB ymmi {k1]{z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Shuffle bytes In ymm2 according to contents of 
ymm3/m256 under write mask k1. 

EVEX.NDS.512.66.0F38.WIG 00 It 

VPSHUFB zmmi {k1}(z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Shuffle bytes in zmm2 according to contents of 
zmm3/m512 under write mask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel” 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel” 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

PSHUFB performs in-place shuffles of bytes in the destination operand (the first operand) according to the shuffle 
control mask in the source operand (the second operand). The instruction permutes the data in the destination 
operand, leaving the shuffle mask unaffected. If the most significant bit (bit[7]) of each byte of the shuffle control 
mask is set, then constant zero is written in the result byte. Each byte in the shuffle control mask forms an index 
to permute the corresponding byte in the destination operand. The value of each index is the least significant 4 bits 
(128-bit operation) or 3 bits (64-bit operation) of the shuffle control byte. When the source operand is a 128-bit 
memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will 
be generated. 

In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15 registers. 

Legacy SSE version 64-bit operand: Both operands can be MMX registers. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (VLMAX- 
1:128) of the corresponding VMM destination register remain unchanged. 

VEX. 128 encoded version: The destination operand is the first operand, the first source operand is the second 
operand, the second source operand is the third operand. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. 

VEX.256 encoded version: Bits (255:128) of the destination VMM register stores the 16-byte shuffle result of the 
upper 16 bytes of the first source operand, using the upper 16-bytes of the second source operand as control mask. 
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The value of each index is for the high 128-bit lane is the least significant 4 bits of the respective shuffle control 
byte. The index value selects a source data element within each 128-bit lane. 

EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory 
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi¬ 
tionally updated with writemask kl. 

EVEX and VEX encoded version: Four/two in-lane 128-bit shuffles. 

Operation 

PSHUFB (with 64 bit operands) 

TEMP ^ DEST 
for i = 0 to 7 { 

if (SRC[(i * 8)+7] = 1 ) then 
DEST[(i*8)+7...(i*8)+0] ^ 0; 
else 

index[2..0] ^ SRC[(i*8)+2 .. (l*8)+0]; 

DEST[(l*8)+7...(l*8)+0] ^ TEMP[(lndex*8+7)..(lndex*8+0)]; 
endif; 

} 

PSHUFB (with 128 bit operands) 

TEMP ^ DEST 
for i = 0 to 15 [ 

if (SRC[(i * 8)+7] = 1 ) then 
DEST[(i*8)+7..(i*8)+0] ^ 0; 
else 

index[3..0] ^ SRC[(i*8)+3 .. (l*8)+0]; 

DEST[(l*8)+7..(l*8)+0] ^ TEMP[(lndex*8+7)..(lndex*8+0)]; 
endif 

} 

VPSHUFB (VEX.128 encoded version) 

for i = 0 to 15 [ 

if (SRC2[(i * 8)+7] = 1)then 
DEST[(i*8)+7..(i*8)+0] ^ 0; 
else 

index[3..0] ^ SRC2[(l*8)+3 .. (i*8)+0]; 

DEST[(l*8)+7..(l*8)+0] ^ SRC1 [(lndex*8+7)..(lndex*8+0)]; 
endif 

} 

DEST[VLMAX-1:128]eO 

VPSHUFB (VEX.256 encoded version) 

for i = 0 to 15 { 

if (SRC2[(i * 8)+7] == 1 ) then 
DEST[(i*8)+7..(i*8)+0] ^ 0; 
else 

index[3..0] ^ SRC2[(l*8)+3 .. (i*8)+0]; 

DEST[(l*8)+7..(l*8)+0] ^ SRC1 [(lndex*8+7)..(lndex*8+0)]; 
endif 

if (SRC2[128 + (i * 8)+7] == 1 ) then 
DEST[128 + (i*8)+7..(i*8)+0] ^ 0; 
else 

index[3..0] ^ SRC2[128 + (i*8)+3 .. (i*8)+0]; 

DEST[128 + (i*8)+7..(l*8)+0] ^ SRC1 [128 + (index*8+7)..(index*8+0)]; 
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endlf 

} 

VPSHUFB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

jmask <- (KL-1) & ~OxF // 0x00,0x10,0x30 depending on the VL 

FORj = OTOKL-1 //dest 

IF kl[ I ] or no_masklng 
Index <- src.byte[ j ]; 

IF index & 0x80 

Dest.byte[j ] <- 0; 

ELSE 

index <- (index & OxF) + (j & jmask); //16-element in-lane lookup 

Dest.byte[ j ] <- src.bytej index ]; 

ELSE if zeroing 

Dest.byte[j ] <- 0; 

DEST[MAX_VL-1:VL]^0; 
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Figure 4-15. PSHUFB with 64-Bit Operands 


Intel C/C++ Compiler Intrinsic Equivalent 

VPSHUFB _m5121 _mm512_shuffle_epi8(_m5121 a, _m5121 b); 

VPSHUFB_m5121 _mm512_mask_shuffle_epi8(_m5121 s,_mmask64 k,_m5121 a,_m5121 b); 

VPSHUFB_m5121 _mm512_maskz_shuffle_epi8(_mmask64 k,_m5121 a,_m5121 b); 

VPSHUFB_m256i _mm256_mask_shuffle_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPSHUFB_m256i _mm256_maskz_shuffle_epi8(_mmask32 k,_m256i a,_m256i b); 

VPSHUFB_ml 281 _mm_mask_shuffle_epi8(_ml 281 s,_mmask16 k,_ml 281 a,_ml 28i b); 

VPSHUFB_m128i_mm_maskz_shuffle_epi8(_mmask16 k,_m128i a,_ml 281 b); 

PSHUFB:_m64 _mm_shuffle_pi8 (_m64 a,_m64 b) 

(V)PSHUFB: _m1281 _mm_shuffle_epi8 (_m128i a, _m1281 b) 

VPSHUFB:_m256i _mm256_shuffle_epi8(_m256i a_m256i b) 

SIMD Floating-Point Exceptions 

None. 
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Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PSHUFD—Shuffle Packed Doublewords 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 70/rib 

PSHUFD xmm 1, xmmZ/m 128, immS 

RMI 

V/V 

SSE2 

Shuffle the doublewords in xmm2/ml28 based on 
the encoding in /mmSand store the result in xmmi. 

VEX.128.66.0F.WIC70/r lb 

VPSHUFD xmm 7, xmm2/m 128, imm8 

RMI 

v/v 

AVX 

Shuffle the doublewords in xmm2/m128 based on 
the encoding in /mmSand store the result in xmmi. 

VEX.256.66.0F.WIG 70 /r lb 

VPSHUFD ymm 7, ymm2/m256, imm8 

RMI 

V/V 

AVX2 

Shuffle the doublewords in \/mm2/m256 based on 
the encoding in /mmSand store the result in ymmi. 

EVEX.128.66.0F.W0 70 /r ib 

VPSHUFD xmmi {k1}{z}, xmm2/m128/m32bcst, 
imm8 

FV 

v/v 

AVX512VL 

AVX512F 

Shuffle the doublewords in xmm2/m128/m32bcst 
based on the encoding in imm8 and store the result 
in xmmi using writemask k1. 

EVEX.256.66.0F.W0 70 /r ib 

VPSHUFD ymmi [k1 }[z}, ymm2/m256/m32bcst, 
imm8 

FV 

v/v 

AVX512VL 

AVX512F 

Shuffle the doublewords in ymm2/m256/m32bcst 
based on the encoding in imm8 and store the result 
in ymmi using writemask k1. 

EVEX.512.66.0F.W0 70 /r ib 

VPSHUFD zmmi [k1 }{z}, zmm2/m512/m32bcst, 
imm8 

FV 

v/v 

AVX512F 

Shuffle the doublewords in zmm2/m512/m32bcst 
based on the encoding in imm8 and store the result 
in zmmi using writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

immS 

NA 

FV 

ModRM:reg (w) 

ModRM:r/m (r) 

ImmS 

NA 


Description 

Copies doublewords from source operand (second operand) and inserts them in the destination operand (first 
operand) at the locations selected with the order operand (third operand). Figure 4-16 shows the operation of the 
256-bit VPSHUFD instruction and the encoding of the order operand. Each 2-bit field in the order operand selects 
the contents of one doubleword location within a 128-bit lane and copy to the target element in the destination 
operand. For example, bits 0 and 1 of the order operand targets the first doubleword element in the low and high 
128-bit lane of the destination operand for 256-bit VPSHUFD. The encoded value of bits 1:0 of the order operand 
(see the field encoding in Figure 4-16) determines which doubleword element (from the respective 128-bit lane) of 
the source operand will be copied to doubleword 0 of the destination operand. 

For 128-bit operation, only the low 128-bit lane are operative. The source operand can be an XMM register or a 
128-bit memory location. The destination operand is an XMM register. The order operand is an 8-bit immediate. 
Note that this instruction permits a doubleword in the source operand to be copied to more than one doubleword 
location in the destination operand. 
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Figure 4-16. 256-bit VPSHUFD Instruction Operation 

The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM 
register. The order operand is an 8-bit immediate. Note that this instruction permits a doubleword in the source 
operand to be copied to more than one doubleword location in the destination operand. 

In 64-bit mode and not encoded in VEX/EVEX, using REX.R permits this instruction to access XMM8-XMM15. 

128-bit Legacy SSE version: Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

VEX.128 encoded version: The source operand can be an XMM register or a 128-bit memory location. The destina¬ 
tion operand is an XMM register. Bits (MAX_VL-1:128) of the corresponding ZMM register are zeroed. 

VEX.256 encoded version: The source operand can be an VMM register or a 256-bit memory location. The destina¬ 
tion operand is an VMM register. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. Bits (255- 
1:128) of the destination stores the shuffled results of the upper 16 bytes of the source operand using the imme¬ 
diate byte as the order operand. 

EVEX encoded version: The source operand can be an ZMM/YMM/XMM register, a 512/256/128-bit memory loca¬ 
tion, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a 
ZMM/YMM/XMM register updated according to the writemask. 

Each 128-bit lane of the destination stores the shuffled results of the respective lane of the source operand using 
the immediate byte as the order operand. 

Note: EVEX.vvvv and VEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 

Operation 

PSHUFD (128-bit Legacy SSE version) 

DEST[31:0] ^ (SRC >> (0RDER[1:0] * 32))[31:0]; 

DEST[63:32] ^ (SRC >> (ORDER[3:2] * 32))[31:0]; 

DEST[95:64] ^ (SRC >> (0RDER[5:4] * 32))[31:0]; 

DEST[127:96] ^ (SRC >> (0RDER[7:6] * 32))[31:0]; 

DEST[VLMAX-1:128] (Unmodified) 

VPSHUFD (VEX.128 encoded version) 

DEST[31:0] ^ (SRC >> (0RDER[1:0] * 32))[31:0]; 

DEST[63:32] ^ (SRC >> (ORDER[3:2] * 32))[31:0]; 

DEST[95:64] ^ (SRC >> (0RDER[5:4] * 32))[31:0]; 

DEST[127:96] ^ (SRC >> (0RDER[7:6] * 32))[31:0]; 

DEST[VLMAX-1:128]^0 
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VPSHUFD (VEX.256 encoded version) 

DEST[31:0] ^ (SRC[127:0] » (0RDER[1:0] * 32))[31:0]; 
DEST[63:32] ^ (SRC[127:0] >> (ORDER[3:2] * 32))[31:0]; 
DEST[95:64] ^ (SRC[127:0] >> (0RDER[5:4] * 32))[31:0]; 
DEST[127:96] ^ (SRC[127:0] >> (ORDER[7:6] * 32))[31:0]; 
DEST[159:128] ^ (SRC[255:128] » (ORDER[1:0] * 32))[31:0]; 
DEST[191:160] ^ (SRC[255:128] » (ORDER[3:2] * 32))[31:0]; 
DEST[223:192] ^ (SRC[255:128] » (ORDER[5:4] * 32))[31:0]; 
DEST[255:224] ^ (SRC[255:128] » (ORDER[7:6] * 32))[31:0]; 
DEST[VLMAX-1:256]^0 


VPSHUFD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF (EVEX.b = 1) AND (SRC *ls memory*) 

THEN TMP_SRC[I+31 :l] ^ SRC[31:0] 

ELSE TMP_SRC[I+31 :l] ^ SRC[I+31 :l] 

FI; 

ENDFOR; 

IFVL>= 128 

TMP_DEST[31:0] <- (TMP_SRC[127:0] » (ORDER[1:0] * 32))[31:0]; 
TMP_DEST[63:32] <- (TMP_SRC[127:0] >> (ORDER[3:2] * 32))[31:0]; 
TMP_DEST[95:64] <- (TMP_SRC[127:0] >> (ORDER[5:4] * 32))[31:0]; 
TMP_DEST[127:96] <- (TMP_SRC[127:0] » (ORDER[7:6] * 32))[31:0]; 

FI; 

IFVL>=256 

TMP_DEST[159:128] <- (TMP_SRC[255:128] >> (ORDER[1:0] * 32))[31:0]; 
TMP_DEST[191:160] <- (TMP_SRC[255:128] >> (ORDER[3:2] * 32))[31:0]; 
TMP_DEST[223:192] <- (TMP_SRC[255:128] >> (ORDER[5:4] * 32))[31:0]; 
TMP_DEST[255:224] <- (TMP_SRC[255:128] >> (ORDER[7:6] * 32))[31:0]; 
FI; 

IFVL>=512 

TMP_DEST[287:256] ^ (TMP_SRC[383:256] >> (ORDER[1:0] * 32))[31:0]; 
TMP_DEST[319:288] ^ (TMP_SRC[383:256] >> (ORDER[3:2] * 32))[31:0]; 
TMP_DEST[351:320] ^ (TMP_SRC[383:256] >> (ORDER[5:4] * 32))[31:0]; 
TMP_DEST[383:352] ^ (TMP_SRC[383:256] >> (ORDER[7:6] * 32))[31:0]; 
TMP_DEST[415:384] ^ (TMP_SRC[511:384] >> (ORDER[1:0] * 32))[31:0]; 
TMP_DEST[447:416] ^ (TMP_SRC[511:384] >> (ORDER[3:2] * 32))[31:0]; 
TMP_DEST[479:448] ^(TMP_SRC[511:384] >> (0RDER[5:4] * 32))[31:0]; 
TMP_DEST[511:480] ^ (TMP_SRC[511:384] >> (ORDER[7:6] * 32))[31:0]; 
FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 
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DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

VPSHUFD _m5121 _mm512_shuffle_epl32(_m5121 a, Int n); 

VPSFIUFD_mSI 2i _mm512_mask_shuffle_epl32(_mSI 2i s,_mmaski 6 k,_mSI 2i a, Int n); 

VPSFIUFD_mSI 2i _mm512_rTiaskz_shuffle_epi32(_mmaski 6 k,_mSI 21 a, int n ); 

VPSFIUFD_m256i _mm256_mask_shuffle_epi32(_m256l s,_mmaskS k,_m256l a, Int n ); 

VPSFIUFD_m256i _mm256_maskz_shuffle_epi32(_mmaskS k,_m256i a, int n ); 

VPSFIUFD_ml 281 _mm_mask_shuffle_epi32(_ml 281 s,_mmask8 k,_ml 281 a, int n ); 

VPSFIUFD_ml 281 _mm_maskz_shuffle_epi32(_mmask8 k,_ml 281 a, int n ); 

(V)PSHUFD:_m1281 _mm_shuffle_epi32(_m1281 a, int n) 

VPSFIUFD:_m256i _mm256_shuffle_epi32(_m256i a, const int n) 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4NF. 

#UD If VEX.vvvv ^ llllB or EVEX.vvvv ^ llllB. 
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PSHUFHW-Shuffle Packed High Words 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

F3 0F 70 /rib 

PSHUFFIW xmml, xmm2/ml28, imm8 

RMI 

V/V 

SSE2 

Shuffle the high words in xmm2/ml28 based 
on the encoding in /mmS and store the result in 
xmml. 

VEX.128.F3.0F.WIG 70 /r lb 

VPSHUFFIW xmml, xmm2/m128, imm8 

RMI 

v/v 

AVX 

Shuffle the high words in xmm2/ml28 based 
on the encoding in /mmS and store the result in 
xmml. 

VEX.256.F3.0F.WIG 70 /r lb 

VPSHUFFIW ymmi, ymm2/m256, imm8 

RMI 

V/V 

AVX2 

Shuffle the high words in ymm2/m256 based 
on the encoding in /mmS and store the result in 
ymmi. 

EVEX.128.F3.0F.WIG 70 /r ib 

VPSHUFHW xmml {k1 }{z}, xmm2/m128, imm8 

FVM 

v/v 

AVX512VL 

AVX512BW 

Shuffle the high words in xmm2/m128 based 
on the encoding in imm8 and store the result in 
xmml underwrite maskki. 

EVEX.256.F3.0F.WIG 70 /r ib 

VPSHUFHW ymmi [k1 }[z}, ymm2/m256, imm8 

FVM 

v/v 

AVX512VL 

AVX512BW 

Shuffle the high words in ymm2/m256 based 
on the encoding in imm8 and store the result in 
ymmi under write mask k1. 

EVEX.512.F3.0F.WIG 70 /r ib 

VPSHUFHW zmmi [k1 }[z}, zmm2/m512, immS 

FVM 

v/v 

AVX512BW 

Shuffle the high words in zmm2/m512 based 
on the encoding in immS and store the result in 
zmmi under write mask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

ImmS 

NA 

FVM 

ModRM:reg (w) 

ModRM:r/m (r) 

ImmS 

NA 


Description 

Copies words from the high quadword of a 128-bit lane of the source operand and inserts them in the high quad- 
word of the destination operand at word locations (of the respective lane) selected with the immediate operand. 
This 256-bit operation is similar to the in-lane operation used by the 256-bit VPSHUFD instruction, which is illus¬ 
trated in Figure 4-16. For 128-bit operation, only the low 128-bit lane is operative. Each 2-bit field in the immediate 
operand selects the contents of one word location in the high quadword of the destination operand. The binary 
encodings of the immediate operand fields select words (0, 1, 2 or 3, 4) from the high quadword of the source 
operand to be copied to the destination operand. The low quadword of the source operand is copied to the low 
quadword of the destination operand, for each 128-bit lane. 

Note that this instruction permits a word in the high quadword of the source operand to be copied to more than one 
word location in the high quadword of the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

128-bit Legacy SSE version: The destination operand is an XMM register. The source operand can be an XMM 
register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

VEX.128 encoded version: The destination operand is an XMM register. The source operand can be an XMM register 
ora 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are zeroed. VEX.vvvv is 
reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD. 

VEX.256 encoded version: The destination operand is an VMM register. The source operand can be an VMM register 
or a 256-bit memory location. 
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EVEX encoded version: The destination operand is a ZMM/YMM/XMM registers. The source operand can be a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the 
writemask. 

Note: In VEX encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 

Operation 

PSHUFHW (1 Z8-bit Legacy SSE version) 

DEST[63:0] ^ SRC[63:0] 

DEST[79:64] ^ (SRC >> (lmm[1:0] *16))[79:64] 

DEST[95:80] ^ (SRC >> (imm[3:2] * 16))[79:64] 

DEST[111:96] ^ (SRC >> (lmm[5:4] * 16))[79:64] 

DEST[127:112] ^ (SRC >> (lmm[7:6] * 16))[79:64] 

DEST[VLMAX-1:128] (Unmodified) 


VPSHUFHW (VEX.128 encoded version) 

DEST[63:0]^SRC1[63:0] 

DEST[79:64] ^ (SRC1 >> (imm[1:0] *16))[79:64] 

DEST[95:80] ^ (SRC1 >> (imm[3:2] * 16))[79:64] 

DEST[111:96] ^ (SRC1 » (imm[5:4] * 16))[79:64] 

DEST[127:112] ^ (SRC1 >> (imm[7:6] * 16))[79:64] 
DEST[VLMAX-1:128]^0 

VPSHUFHW (VEX.256 encoded version) 

DEST[63:0]^SRC1[63:0] 

DEST[79:64] ^ (SRC1 >> (imm[1:0] *16))[79:64] 

DEST[95:80] ^ (SRC1 >> (imm[3:2] * 16))[79:64] 

DEST[111:96] ^ (SRC1 » (imm[5:4] * 16))[79:64] 

DEST[127:112] ^ (SRC1 >> (imm[7:6] * 16))[79:64] 

DEST[191:128] ^ SRC1 [191:128] 

DEST[207192] ^ (SRC1 » (imm[1:0] *16))[207:192] 
DEST[223:208] ^ (SRC1 >> (imm[3:2] * 16))[207:192] 
DEST[239:224] ^ (SRC1 >> (imm[5:4] * 16))[207:192] 
DEST[255:240] ^ (SRC1 >> (imm[7:6] * 16))[207:192] 
DEST[VLMAX-1:256]^0 

VPSHUFHW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

IFVL>= 128 

TMP_DEST[63:0] ^ SRC1 [63:0] 

TMP_DEST[79:64] ^ (SRC1 » (imm[1:0] *16))[79:64] 
TMP_DEST[95:80] ^ (SRC1 » (imm[3:2] * 16))[79:64] 
TMP_DEST[111:96] ^ (SRC1 » (imm[5:4] * 16))[79:64] 
TMP_DEST[127:112] ^ (SRC1 » (imm[7:6] * 16))[79:64] 

FI; 

IFVL>= 256 

TMP_DEST[191:128] ^ SRC1 [191:128] 

TMP_DEST[207:192] ^ (SRC1 » (imm[1:0] *16))[207:192] 
TMP_DEST[223:208] ^ (SRC1 » (imm[3:2] * 16))[207:192] 
TMP_DEST[239:224] ^ (SRC1 » (imm[5:4] * 16))[207:192] 
TMP_DEST[255:240] ^ (SRC1 » (imm[7:6] * 16))[207:192] 
FI; 

IFVL>= 512 

TMP_DEST[319:256] ^ SRC1 [319:256] 
TMP_DEST[335:320] ^ (SRC1 » (imm[1:0] *16))[335:320] 
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TMP_DEST[351:336] ^ (SRC1 >> (lmm[3:2] * 16))[335:320] 

TMP_DEST[367:352] ^ (SRC1 >> (lmm[5:4] * 16))[335:320] 

TMP_DEST[383:368] ^ (SRC1 >> (lmm[7:6] * 16))[335:320] 

TMP_DEST[447:384] ^ SRC1 [447:384] 

TMP_DEST[463:448] ^ (SRC1 >> (lmm[1:0] *16))[463:448] 

TMP_DEST[479:464] ^ (SRC1 >> (lmm[3:2] * 16))[463:448] 

TMP_DEST[495:480] ^ (SRC1 >> (lmm[5:4] * 16))[463:448] 

TMP_DEST[511:496] ^ (SRC1 >> (lmm[7:6] * 16))[463:448] 

FI; 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k1 [j] OR *no writemask* 

THEN DEST[l+15:i] ^ TMP_DEST[I+15:I]; 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masking* ; zeroing-masking 

DEST[i+15:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

Intel C/C-r-i- Compiler Intrinsic Equivalent 

VPSHUFHW _m512i _mm512_shufflehLepl16(_m5121 a, Int n); 

VPSHUFHW_m512i _mm512_mask_shufflehl_epl16(_m5121 s,_mmaski 6 k,_m5121 a, Int n ); 

VPSHUFHW_m512i _mm512_maskz_shufflehi_epi16(_mmaski 6 k,_m512i a, Int n ); 

VPSHUFHW_m256i _mm256_mask_shufflehl_epl16(_m256i s,_mmask8 k,_m256i a, Int n ); 

VPSHUFHW_m256i_mm256_maskz_shufflehi_epi16(_mmask8 k,_m256l a, int n ); 

VPSHUFHW_ml 28i _mm_mask_shufflehi_epi16(_ml 28i s,_mmask8 k,_ml 28i a, int n ); 

VPSHUFHW_ml 28i _mm_maskz_shufflehi_epi16(_mmask8 k,_ml 28i a, int n); 

(V)PSHUFHW:_m128i _mm_shufflehLepi16(_m128i a, int n) 

VPSHUFHW:_m256i _mm256_shufflehi_epi16(_m256i a, const int n) 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4; 

EVEX-encoded instruction, see Exceptions Type E4NF.nb 

#UD If VEX.vvvv != llllB, or EVEX.vvvv != llllB. 
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PSHUFLW—Shuffle Packed Low Words 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

F2 OF 70 /rib 

PSFIUFLW xmmi, xmm2/m128, imm8 

RMI 

V/V 

SSE2 

Shuffle the low words in xmm2/m 128 based on 
the encoding in /mmS and store the result in 
xmmi. 

VEX.128.F2.0F.WIG70 /r ib 

VPSFIUFLW xmmi, xmm2/m128, imm8 

RMI 

v/v 

AVX 

Shuffle the low words in xmm2/m 128 based on 
the encoding in /mmS and store the result in 
xmmi. 

VEX.256.F2.0F.WIG 70 /r ib 

VPSFIUFLW ymml, ymm2/m256, imm8 

RMI 

V/V 

AVX2 

Shuffle the low words in ymm2/m256 based on 
the encoding in imm8 and store the result in 
ymml. 

EVEX.128.F2.0F.WIG 70/r ib 

VPSHUFLW xmmi [k1}[z}, xmm2/m128, imm8 

FVM 

v/v 

AVX512VL 
AVX512BW 

Shuffle the low words in xmm2/m128 based on 
the encoding in imm8 and store the result in 
xmmi under write mask k1. 

EVEX.256.F2.0F.WIG 70 /r ib 

VPSHUFLW ymml {k1 }{z}, ymm2/m256, imm8 

FVM 

v/v 

AVX512VL 
AVX512BW 

Shuffle the low words in ymm2/m256 based on 
the encoding in imm8 and store the result in 
ymml under write mask k1. 

EVEX.512.F2.0F.WIG 70/rib 

VPSHUFLW zmmi {k1}{z}, zmm2/m512, imm8 

FVM 

v/v 

AVX512BW 

Shuffle the low words in zmm2/m512 based on 
the encoding in imm8 and store the result in 
zmmi under write mask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

imm8 

NA 

FVM 

ModRM:reg (w) 

ModRM:r/m (r) 

ImmS 

NA 


Description 

Copies words from the low quadword of a 128-bit lane of the source operand and inserts them in the low quadword 
of the destination operand at word locations (of the respective lane) selected with the immediate operand. The 
256-bit operation is similar to the in-lane operation used by the 256-bit VPSHUFD instruction, which is illustrated 
in Figure 4-16. For 128-bit operation, only the low 128-bit lane is operative. Each 2-bit field in the immediate 
operand selects the contents of one word location in the low quadword of the destination operand. The binary 
encodings of the immediate operand fields select words (0,1,2 or 3) from the low quadword of the source operand 
to be copied to the destination operand. The high quadword of the source operand is copied to the high quadword 
of the destination operand, for each 128-bit lane. 

Note that this instruction permits a word in the low quadword of the source operand to be copied to more than one 
word location in the low quadword of the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

128-bit Legacy SSE version: The destination operand is an XMM register. The source operand can be an XMM 
register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destination register remain 
unchanged. 

VEX.128 encoded version: The destination operand is an XMM register. The source operand can be an XMM register 
or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are zeroed. 

VEX.256 encoded version: The destination operand is an VMM register. The source operand can be an VMM register 
or a 256-bit memory location. 

EVEX encoded version: The destination operand is a ZMM/YMM/XMM registers. The source operand can be a 
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the 
writemask. 
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Note: In VEX encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 

Operation 

PSHUFLW (128-bit Legacy SSE version) 

DEST[15:0] ^ (SRC » (imm[1:0] *16))[15:0] 

DEST[31:16] ^ (SRC » (imm[3:2] * 16))[15:0] 

DEST[47:32] ^ (SRC » (imm[5:4] * 16))[15:0] 

DEST[63:48] ^ (SRC » (imm[7:6] * 16))[15:0] 

DEST[127:64] ^ SRC[127:64] 

DEST[VLMAX-1:128] (Unmodified) 


VPSHUFLW (VEX.128 encoded version) 

DEST[15:0] ^ (SRC1 » (imm[1:0] *16))[15:0] 
DEST[31:16] ^ (SRC1 >> (imm[3:2] * 16))[15:0] 
DEST[47:32] ^ (SRC1 >> (imm[5:4] * 16))[15:0] 
DEST[63:48] ^ (SRC1 >> (imm[7:6] * 16))[15:0] 

DEST[127:64] ^ SRC[127:64] 
DEST[VLMAX-1:128]^0 

VPSHUFLW (VEX.256 encoded version) 

DEST[15:0] ^ (SRC1 » (imm[1:0] *16))[15:0] 
DEST[31:16] ^ (SRC1 >> (imm[3:2] * 16))[15:0] 
DEST[47:32] ^ (SRC1 >> (imm[5:4] * 16))[15:0] 
DEST[63:48] ^ (SRC1 >> (imm[7:6] * 16))[15:0] 

DEST[127:64] ^ SRC1 [127:64] 

DEST[143:128] ^ (SRC1 >> (imm[1:0] *16))[143:128] 
DEST[159:144] ^ (SRC1 >> (imm[3:2] * 16))[143:128] 
DEST[175:160] ^ (SRC1 >> (imm[5:4] * 16))[143:128] 
DEST[191:176] ^ (SRC1 >> (imm[7:6] * 16))[143:128] 
DEST[255:192] ^ SRC1 [255:192] 
DEST[VLMAX-1:256]^0 


VPSHUFLW (EVEX.U1.512 encoded version) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

IFVL>= 128 

TMP_DEST[15:0] ^ (SRC1 >> (imm[1:0] *16))[15:0] 
TMP_DEST[31:16] ^ (SRC1 >> (imm[3:2] * 16))[15:0] 
TMP_DEST[47:32] ^ (SRC1 >> (imm[5:4] * 16))[15:0] 
TMP_DEST[63:48] ^ (SRC1 >> (imm[7:6] * 16))[15:0] 
TMP_DEST[127:64] ^ SRC1 [127:64] 

FI; 

IFVL>=256 

TMP_DEST[143:128] ^ (SRC1 > > (imm[1:0] *16))[143:128] 
TMP_DEST[159:144] ^ (SRC1 >> (imm[3:2] * 16))[143:128] 
TMP_DEST[175:160] ^ (SRC1 >> (imm[5:4] * 16))[143:128] 
TMP_DEST[191:176] ^ (SRC1 >> (imm[7:6] * 16))[143:128] 
TMP_DEST[255:192] ^ SRC1 [255:192] 

FI; 

IFVL>=512 

TMP_DEST[271:256] ^ (SRC1 >> (imm[1:0] *16))[271:256] 
TMP_DEST[287:272] ^ (SRC1 >> (imm[3:2] * 16))[271:256] 
TMP_DEST[303:288] ^ (SRC1 >> (imm[5:4] * 16))[271:256] 
TMP_DEST[319:304] ^ (SRC1 >> (imm[7:6] * 16))[271:256] 
TMP_DEST[383:320] ^ SRC1 [383:320] 
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TMP_DEST[399:384] ^ (SRC1 » (imm[1:0] *16))[399:384] 

TMP_DEST[415:400] ^ (SRC1 » (imm[3:2] * 16))[399:384] 

TMP_DEST[431:416] ^ (SRC1 » (imm[5:4] * 16))[399:384] 

TMP_DEST[447:432] ^ (SRC1 » (imm[7:6] * 16))[399:384] 

TMP_DEST[511:448] ^ SRC1 [511:448] 

FI; 

FOR] ^0 TO KL-1 
i 16 

IF k10] OR *no writemask* 

THEN DEST[i+15:1] ^ TMP_DEST[i+15:i]; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masklng* ; zeroIng-maskIng 

DEST[I+15:I]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C-r-i- Compiler Intrinsic Equivalent 

VPSHUFLW _m5121 _mm512_shufflelo_epi16(_m5121 a, int n); 

VPSHUFLW_m512i _mm512_mask_shufflelo_epi16(_m5121 s,_mmask16 k,_m5121 a, int n); 

VPSHUFLW_m5121 _mm512_maskz_shufflelo_epi16(_mmaski 6 k,_m5121 a, int n); 

VPSHUFLW_m256i _mm256_mask_shufflelo_epi16(_m256i s,_mmask8 k,_m256i a, int n); 

VPSHUFLW_m256i _mm256_maskz_shufflelo_epi16(_mmask8 k,_m256i a, int n ); 

VPSHUFLW_ml 28i _mm_mask_shufflelo_epi16(_ml 28i s,_mmask8 k,_ml 28i a, int n); 

VPSHUFLW_ml 28i _mm_maskz_shufflelo_epi16(_mmask8 k,_ml 28i a, int n ); 

(V)PSHUFLW:_m128i _mm_shufflelo_epi16(_m128i a, int n) 

VPSHUFLW:_m256i _mm256_shufflelo_epi16(_m256i a, const int n) 

Flags Affected 

None. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4; 

EVEX-encoded instruction, see Exceptions Type E4NF.nb 

#UD If VEX.vvvv != llllB, or EVEX.vvvv != llllB. 
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PSHUFW—Shuffle Packed Words 


Opcode/ 

Op/ 

64-Bit 

Compat/ 

Description 

Instruction 

En 

Mode 

Leg Mode 


OF 70 /rib 

PSHUFW mm7, mmZ/m64, imm8 

RMI 

Valid 

Valid 

Shuffle the words in mmZ/m64 based on the 
encoding in /mmSand store the result in mm7. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

imm8 

NA 


Description 

Copies words from the source operand (second operand) and inserts them in the destination operand (first 
operand) at word locations selected with the order operand (third operand). This operation is similar to the opera¬ 
tion used by the PSHUFD instruction, which is illustrated in Figure 4-16. For the PSHUFW instruction, each 2-bit 
field in the order operand selects the contents of one word location in the destination operand. The encodings of the 
order operand fields select words from the source operand to be copied to the destination operand. 

The source operand can be an MMX technology register or a 64-bit memory location. The destination operand is an 
MMX technology register. The order operand is an 8-bit immediate. Note that this instruction permits a word in the 
source operand to be copied to more than one word location in the destination operand. 

In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers 
(XMM8-XMM15). 

Operation 

DEST[15:0] ^ (SRC » (0RDER[1:0] * 16))[15:0]; 

DEST[31:16] ^ (SRC » (ORDER[3:2] * 16))[15:0]; 

DEST[47:32] ^ (SRC » (0RDER[5:4] * 16))[15:0]; 

DEST[63:48] ^ (SRC » (0RDER[7:6] * 16))[15:0]; 

Intel C/C++ Compiler Intrinsic Equivalent 

PSHUFW: _m64 _mm_shuffle_pi16( m64 a, int n) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

See Table 22-7, "Exception Conditions for SIMD/MMX Instructions with Memory Reference," in the I ntel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 3A. 
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PSIGNB/PSIGNW/PSIGND - Packed SIGN 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 38 08 /r' 

PSIGNB mm 7, mm2/m64 

RM 

V/V 

SSSE3 

Negate/zero/preserue packed byte integers in 
mm 7 depending on the corresponding sign in 
mm2/m64. 

66 OF 38 08 /r 

PSIGN8 xmmi, xmm2/ml28 

RM 

v/v 

SSSE3 

Negate/zero/preserue packed byte integers in 
xmm 7 depending on the corresponding sign in 
xmm2/m 7 28. 

OF 38 09 /r' 

PSIGNW mm 7, mm2/m64 

RM 

V/V 

SSSE3 

Negate/zero/preserve packed word integers 
in mml depending on the corresponding sign 
in mm2/m128. 

66 OF 38 09 /r 

PSIGNW xmm 1, xmm2/m 7 28 

RM 

v/v 

SSSE3 

Negate/zero/preserve packed word integers 
in xmm 7 depending on the corresponding sign 
in xmm 2/m 7 28. 

OF 38 OA /r' 

PSIGND mml, mm2/m64 

RM 

v/v 

SSSE3 

Negate/zero/preserve packed doubleword 
Integers In mm 7 depending on the 
corresponding sign in mm2/m128. 

66 OF 38 OA /r 

PSIGND xmm 7, xmm2/m 7 28 

RM 

v/v 

SSSE3 

Negate/zero/preserve packed doubleword 
Integers In xmm 7 depending on the 
corresponding sign in xmm2/ml28. 

VEX.NDS.128.66.0F38.WIG 08 /r 

VPSIGNB xmm 7, xmm2, xmm3/m 7 28 

RVM 

v/v 

AVX 

Negate/zero/preserve packed byte integers in 
xmm2 depending on the corresponding sign in 
xmm3/m 7 28. 

VEX.NDS.128.66.0F38.WIG 09 /r 

VPSIGNW xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Negate/zero/preserve packed word integers 
in xmm2 depending on the corresponding sign 
in xmm 3/m 7 28. 

VEX.NDS.128.66.0F38.WIG OA /r 

VPSIGND xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Negate/zero/preserve packed doubleword 
Integers In xmm2 depending on the 
corresponding sign in xmm3/ml28. 

VEX.NDS.256.66.0F38.WIG 08 /r 

VPSIGN8 ymm 7, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Negate packed byte integers in ymm2 if the 
corresponding sign in ymm3/m256 is less 
than zero. 

VEX.NDS.256.66.0F38.WIG 09 /r 

VPSIGNW ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Negate packed 16-blt Integers In ymm2 If the 
corresponding sign in ymm3/m256 is less 
than zero. 

VEX.NDS.256.66.0F38.WIG OA /r 

VPSIGND ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Negate packed doubleword Integers in ymm2 
if the corresponding sign in ymm3/m256 is 
less than zero. 

NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume 2A and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vwv (r) 

ModRM:r/m (r) 

NA 
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Description 

(V)PSIGNB/(V)PSIGNW/(V)PSIGND negates each data element of the destination operand (the first operand) if the 
signed integer value of the corresponding data element in the source operand (the second operand) is less than 
zero. If the signed integer value of a data element in the source operand is positive, the corresponding data 
element in the destination operand is unchanged. If a data element in the source operand is zero, the corre¬ 
sponding data element in the destination operand is set to zero. 

(V)PSIGNB operates on signed bytes. (V)PSIGNW operates on 16-bit signed words. (V)PSIGND operates on signed 
32-bit integers. When the source operand is a 128bit memory operand, the operand must be aligned on a 16-byte 
boundary or a general-protection exception (#GP) will be generated. 

Legacy SSE instructions: Both operands can be MMX registers. In 64-bit mode, use the REX prefix to access addi¬ 
tional registers. 

128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the corresponding VMM destina¬ 
tion register remain unchanged. 

VEX. 128 encoded version: The first source and destination operands are XMM registers. The second source 
operand is an XMM register or a 128-bit memory location. Bits (VLMAX-1:128) of the destination VMM register are 
zeroed. VEX.L must be 0, otherwise instructions will #UD. 

VEX.256 encoded version: The first source and destination operands are VMM registers. The second source 
operand is an VMM register or a 256-bit memory location. 

Operation 

PSIGNB (with 64 bit operands) 

IF (SRC[7:0] < 0 ) 

DEST[7:0] ^ Neg(DEST[7:0]) 

ELSEIF (SRC[7:0] = 0 ) 

DEST[7:0] ^ 0 
ELSEIF (SRC[7:0] > 0 ) 

DEST[7:0] ^ DEST[7:0] 

Repeat operation for 2nd through 7th bytes 

IF (SRC[63:56] < 0 ) 

DEST[63:56] ^ Neg(DEST[63:56]) 

ELSEIF (SRC[63:56] = 0 ) 

DEST[63:56] ^ 0 
ELSEIF (SRC[63:56] > 0 ) 

DEST[63:56] ^ DEST[63:56] 

PSIGNB (with 128 bit operands) 

IF (SRC[7:0] < 0 ) 

DEST[7:0] ^ Neg(DEST[7:0]) 

ELSEIF (SRC[7:0] = 0 ) 

DEST[7:0] ^ 0 
ELSEIF (SRC[7:0] > 0 ) 

DEST[7:0] ^ DEST[7:0] 

Repeat operation for 2nd through 15th bytes 
IF(SRC[127:120]<0) 

DEST[127:120] ^ Neg(DEST[127:120]) 

ELSEIF(SRC[127:120] = 0) 

DEST[127:120]^0 

ELSEIF(SRC[127:120]>0) 

DEST[127:120] ^ DEST[127:120] 
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VPSIGNB (VEX.128 encoded version) 

DEST[127:0] ^BYTE_SIGN(SRC1, SRC2) 
DEST[VLMAX-1:1281^0 


VPSIGNB (VEX.256 encoded version) 

DEST[255:0] ^BYTE_SIGN_256b(SRC1, SRC2) 

PSIGNW (with 64 bit operands) 

IF(SRC[15:0]<0) 

DEST[15:0] ^ Neg(DEST[15:0]) 

ELSEIF (SRC[15:0] = 0 ) 

DEST[15:0] ^ 0 
ELSEIF (SRC[15:0] > 0 ) 

DEST[15:0] ^ DEST[15:0] 

Repeat operation for 2nd through 3rd words 
IF (SRC[63:48] < 0) 

DEST[63:48] ^ Neg(DEST[63:48]) 
ELSEIF (SRC[63:48] = 0 ) 

DEST[63:48] ^ 0 
ELSEIF (SRC[63:48] > 0 ) 

DEST[63:48] ^ DEST[63:48] 

PSIGNW (with 128 bit operands) 

IF(SRC[15:0]<0) 

DEST[15:0] ^ Neg(DEST[15:0]) 

ELSEIF (SRC[15:0] = 0 ) 

DEST[15:0] ^ 0 
ELSEIF (SRC[15:0] > 0 ) 

DEST[15:0] ^ DEST[15:0] 

Repeat operation for 2nd through 7th words 
IF(SRC[127:112]<0) 

DEST[127:112] ^ Neg(DEST[127:112]) 
ELSEIF (SRC[127:112] = 0) 
DEST[127:112]^0 
ELSEIF(SRC[127:112]>0) 

DEST[127:112] ^ DEST[127:112] 


VPSIGNW (VEX.128 encoded version) 

DEST[127:0] ^W0RD_SIGN(SRC1, SRC2) 
DEST[VLMAX-1:128]^0 


VPSIGNW (VEX.256 encoded version) 

DEST[255:0] ^W0RD_SIGN(SRC1, SRC2) 

PSIGND (with 64 bit operands) 

IF(SRC[31:0]<0) 

DEST[31:0] ^ Neg(DEST[31:0]) 
ELSEIF(SRC[31:0] = 0) 
DEST[31:0]^0 
ELSEIF (SRC[31:0]>0) 

DEST[31:0]^DEST[31:0] 
IF(SRC[63:32] < 0) 

DEST[63:32] ^ Neg(DEST[63:32]) 
ELSEIF (SRC[63:32] = 0 ) 

DEST[63:32] ^ 0 
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ELSEIF (SRC[63:32] > 0 ) 

DEST[63:32] ^ DEST[63:32] 

PSIGND (with 128 bit operands) 

IF(SRC[31:0] < 0) 

DEST[31:0] ^ Neg(DEST[31:0]) 
ELSEIF(SRC[31:0] = 0) 

DEST[31:0]^0 
ELSEIF (SRC[31:0] >0) 

DEST[31:0]^DEST[31:0] 

Repeat operation for 2nd through 3rd double words 
IF(SRC[127:96] < 0) 

DEST[127:96] ^ Neg(DEST[127:96]) 
ELSEIF(SRC[127:96] = 0) 

DEST[1 27:96] ^ 0 
ELSEIF (SRC[127:96] >0) 

DEST[127:96] ^ DEST[127:96] 


VPSIGND (VGX.128 encoded version) 

DEST[127:0] ^DW0RD_SIGN(SRC1, SRC2) 
DEST[VLMAX-1:128]^0 


VPSIGND (VEX.256 encoded version) 

DEST[255:0] ^DW0RD_SIGN(SRC1, SRC2) 


Intel C/C++ Compiler Intrinsic Equivalent 


PSIGNB: 

(V)PSIGNB: 

VPSIGNB: 

PSIGNW: 

(V)PSIGNW: 

VPSIGNW: 

PSIGND: 

(V)PSIGND: 

VPSIGND: 


_m64 _mm_sign_piB (_m64 a,_m64 b) 

ml 2BI _mm_sign_eplB (_ml 2Bi a,_ml 2BI b) 

m256i _mm256_slgn_epiB (_m256i a,_m256i b) 

m64 _mm_slgn_pl16 (_m64 a,_m64 b) 

ml 2BI _mm_sign_epl16 (_ml 2BI a,_ml 2BI b) 

m256i _mm256_slgn_epi16 (_m256l a,_m256i b) 

m64 _mm_sign_pl32 (_m64 a,_m64 b) 

_m12Bi _mm_sign_epl32 (_ml 2BI a,_ml 2BI b) 

_m256l _mm256_slgn_epi32 (_m256l a,_m256i b) 


SIMD Floating-Point Exceptions 

None. 


Other Exceptions 

See Exceptions Type 4; additionally 
#UD IfVEX.L=l. 
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PSLLDQ—Shift Double Quadword Left Logical 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 73 /7 lb 

PSLLDQ xmmi, immS 

Ml 

V/V 

SSE2 

Shift xmm 7 left by /mmS bytes while shifting 
in Os. 

VEX.NDD.128.66.0F.WIG 73 /7 ib 

VPSLLDQ xmm 1, xmmZ, imm8 

VMI 

v/v 

AVX 

Shift xmmZ left by imm8 bytes while shifting 
in Os and store result in xmmi. 

VEX.NDD.256.66.0F.WIG 73 /7 ib 

VPSLLDQ ymm 1, ymmZ, imm8 

VMI 

V/V 

AVX2 

Shift ymmZ left by imm8 bytes while shifting 
in Os and store result in ymmi. 

EVEX.NDD.128.66.0F.WIG 73 /7 ib 

VPSLLDQ xmm1,xmm2/ ml 28, imm8 

FVMI 

v/v 

AVX512VL 
AVX512BW 

Shift xmm2/m128 left by imm8 bytes while 
shifting in Os and store result in xmmi. 

EVEX.NDD.256.66.0F.WIG 73 /7 ib 

VPSLLDQ ymmi, ymm2/m256, imm8 

FVMI 

v/v 

AVX512VL 
AVX512BW 

Shift ymm2/m256 left by immS bytes while 
shifting in Os and store result in ymmi. 

EVEX.NDD.512.66.0F.WIG 73 /7 ib 

VPSLLDQ zmmi, zmm2/m512, imm8 

FVMI 

v/v 

AVX512BW 

Shift zmm2/m512 left by imm8 bytes while 
shifting in Os and store result in zmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

Ml 

ModRM:r/m (r, w) 

imm8 

NA 

NA 

VMI 

VEX.vvvv (w) 

ModRM:r/m (r) 

ImmB 

NA 

FVMI 

EVEX.vvvv (w) 

ModRM:r/m (R) 

Imm8 

NA 


Description 

Shifts the destination operand (first operand) to the left by the number of bytes specified in the count operand 
(second operand). The empty low-order bytes are cleared (set to all Os). If the value specified by the count 
operand is greater than 15, the destination operand is set to all Os. The count operand is an 8-bit immediate. 

128-bit Legacy SSE version: The source and destination operands are the same. Bits (VLMAX-1:128) of the corre¬ 
sponding VMM destination register remain unchanged. 

VEX.128 encoded version: The source and destination operands are XMM registers. Bits (VLMAX-1:128) of the 
destination VMM register are zeroed. 

VEX.256 encoded version: The source operand is VMM register. The destination operand is an VMM register. Bits 
(MAX_VL-1:256) of the corresponding ZMM register are zeroed. The count operand applies to both the low and 
high 128-bit lanes. 

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. 
The destination operand is a ZMM/YMM/XMM register. The count operand applies to each 128-bit lanes. 

Operation 

VPSLLDQ (EVEX.U1.512 encoded version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP ^ 16; FI 
DEST[127:0] <r SRC[127:0] « (TEMP * 8) 

DEST[255:128] ^ SRC[255:128] « (TEMP * 8) 

DEST[383:256] ^ SRC[383:256] « (TEMP * 8) 

DEST[511:384] ^ SRC[511:384] « (TEMP * 8) 

DEST[MAX_VL-1:512]^0 
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VPSLLDQ (VEX.256 and EVEX.Z56 encoded version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP <- 16; FI 
DEST[127:0] <- SRC[127:0] << (TEMP * 8) 

DEST[255:128] <- SRC[255:128] « (TEMP * 8) 
DEST[MAX_VL-1:256]^0 


VPSLLDQ (VEX.128 and EVEX.128 encoded version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP <- 16; FI 
DEST ^ SRC « (TEMP * 8) 

DEST[MAX_VL-1:128]^0 


PSLLDQ(128-bit Legacy SSE version) 

TEMP ^ COUNT 

IF (TEMP > 15) THEN TEMP ^ 16; FI 
DEST ^ DEST « (TEMP * 8) 

DEST[VLMAX-1:128] (Unmodified) 

Intel C/C-r-i- Compiler Intrinsic Equivalent 

(V)PSLLDQ:_ml 281 _mm_slll_sl128 (_ml 281 a, Int Imm) 

VPSLLDQ:_m256i_mm256_slll_si256 (_m256l a, const Int Imm) 

VPSLLDQ_mSI 2i _mm512_bslli_epi128 (_mSI 21 a, const Int Imm) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 7. 
EVEX-encoded instruction, see Exceptions Type E4NF.nb. 


4-432 Vol. 2B 


PSLLDQ—Shift Double Quadword Left Logical 


INSTRUCTION SET REFERENCE, M-U 


PSLLW/PSLLD/PSLLQ-Shift Packed Data Left Logical 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF FI /r' 

PSLLW mm, mm/m64 

RM 

V/V 

MMX 

Shift words in mm left mm/m64 while shifting in 
Os. 

66 OF FI Ir 

PSLLW xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Shift words in xmml left by xmm2/m728 while 
shifting in Os. 

OF 71 /6 lb 

PSLLW mml, imm8 

Ml 

V/V 

MMX 

Shift words in mm left by /mmS while shifting in 
Os. 

66 OF 71 /6 lb 

PSLLW xmm 1, imm8 

Ml 

v/v 

SSE2 

Shift words in xmml left by /mmS while shifting 
in Os. 

OF F2 /r' 

PSLLD mm, mm/m64 

RM 

v/v 

MMX 

Shift doublewords in mm left by mm/m64 while 
shifting in Os. 

66 OF F2 Ir 

PSLLD xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Shift doublewords in xmm 1 left by xmm2/m 128 
while shifting in Os. 

OF 72 /6 lb' 

PSLLD mm, imm8 

Ml 

v/v 

MMX 

Shift doublewords in mm left by /mmS while 
shifting in Os. 

66 OF 72 /6 lb 

PSLLD xmml, imm8 

Ml 

v/v 

SSE2 

Shift doublewords in xmml left by /mmS while 
shifting in Os. 

OF F3 Ir' 

PSLLQ mm, mm/m64 

RM 

v/v 

MMX 

Shift quadword in mm left by mm/m64 while 
shifting in Os. 

66 OF F3 Ir 

PSLLQ xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Shift quadwords in xmml left by xmm2/ml28 
while shifting in Os. 

OF 73 /6 lb' 

PSLLQ mm, imm8 

Ml 

v/v 

MMX 

Shift quadword in mm left by /mmS while 
shifting in Os. 

66 OF 73 /6 lb 

PSLLQ xmml, imm8 

Ml 

v/v 

SSE2 

Shift quadwords in xmml left by /mm8 while 
shifting in Os. 

VEX.NDS.128.66.0F.WIGF1 /r 

VPSLLW xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift words in xmm2 left by amount specified in 
xmm3/ml28 while shifting in Os. 

VEX.NDD.128.66.0F.WIG 71 /6 lb 

VPSLLW xmml, xmm2, imm8 

VMI 

v/v 

AVX 

Shift words in xmm2 left by /mm8 while shifting 
in Os. 

VEX.NDS.128.66.0F.WIG F2 /r 

VPSLLD xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift doublewords in xmm2 left by amount 
specified in xmm3/m 128 while shifting in Os. 

VEX.NDD.128.66.0F.WIG 72 /6 lb 

VPSLLD xmm 1, xmm2, imm8 

VMI 

v/v 

AVX 

Shift doublewords in xmm2 left by /mm8 while 
shifting in Os. 

VEX.NDS.128.66.0F.WIG F3 /r 

VPSLLQ xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift quadwords in xmm2 left by amount 
specified in xmm3/m 128 while shifting in Os. 

VEX.NDD.128.66.0F.WIG 73 /6 lb 

VPSLLQ xmm 1, xmm2, imm8 

VMI 

v/v 

AVX 

Shift quadwords in xmm2 left by imm8 while 
shifting in Os. 

VEX.NDS.256.66.0F.WIG FI /r 

VPSLLW ymm 1, ymm2, xmm3/m 128 

RVM 

v/v 

AVX2 

Shift words in ymm2 left by amount specified in 
xmm3/m 728 while shifting in Os. 

VEX.NDD.256.66.0F.WIG 71 /6 lb 

VPSLLW ymm 1, ymm2, imm8 

VMI 

v/v 

AVX2 

Shift words in ymm2 left by /mm8 while shifting 
in Os. 
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VEX.NDS.256.66.0F.WIG F2 /r 

VPSLLD ymm 7, ymmZ, xmm3/m 7 28 

RVM 

V/V 

AVX2 

Shift doublewords in ymm2 left by amount 
specified in xmm3/m 7 28 while shifting in Os. 

VEX.NDD.256.66.0F.WIG 72 /6 lb 

VPSLLD ymmi, ymmZ, imm8 

VMI 

V/V 

AVX2 

Shift doublewords in ymm2 left by imm8 while 
shifting in Os. 

VEX.NDS.256.66.0F.WIG F3 /r 

VPSLLQ ymm 1, ymmZ, xmm3/m 7 28 

RVM 

V/V 

AVX2 

Shift guadwords in ymm2 left by amount 
specified in xmm3/m 7 28 while shifting in Os. 

VEX.NDD.256.66.0F.WIG 73 /6 lb 

VPSLLQ ymm 1, ymm2, imm8 

VMI 

V/V 

AVX2 

Shift guadwords in ymm2 left by /mm8 while 
shifting in Os. 

EVEX.NDS.128.66.0F.WIG FI /r 

VPSLLW xmmi {k1]{z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 
AVX512BW 

Shift words in xmm2 left by amount specified in 
xmm3/m128 while shifting in Os using 
writemask k1. 

EVEX.NDS.256.66.0F.WIG FI /r 

VPSLLW ymmi {k1}[z}, ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512BW 

Shift words in ymm2 left by amount specified in 
xmm3/m128 while shifting in Os using 
writemask k1. 

EVEX.NDS.512.66.0F.WIGF1 /r 

VPSLLW zmmi {k1 }[z}, zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512BW 

Shift words in zmm2 left by amount specified in 
xmm3/m128 while shifting in Os using 
writemask k1. 

EVEX.NDD.1 28.66.0F.WIG 71 /6 lb 

VPSLLW xmmi {k1}{z}, xmm2/m128, imm8 

FVMI 

V/V 

AVX512VL 

AVX512BW 

Shift words in xmm2/m128 left by imm8 while 
shifting in Os using writemask k1. 

EVEX.NDD.256.66.0F.WIG 71 /6 lb 

VPSLLW ymmi {k1}[z}, ymm2/m256, imm8 

FVMI 

V/V 

AVX512VL 

AVX512BW 

Shift words in ymm2/m256 left by imm8 while 
shifting in Os using writemask k1. 

EVEX.NDD.512.66.0F.WIG 71 /6 lb 

VPSLLW zmmi {k1}{z}, zmm2/m512, imm8 

FVMI 

V/V 

AVX512BW 

Shift words in zmm2/m512 left by imm8 while 
shifting in 0 using writemask k1. 

EVEX.NDS.128.66.0F.W0 F2 /r 

VPSLLD xmmi {k1}{z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 
AVX512F 

Shift doublewords in xmm2 left by amount 
specified in xmm3/m128 while shifting in Os 
under writemask k1. 

EVEX.NDS.256.66.0F.W0 F2 /r 

VPSLLD ymmi {k1 }{z}, ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in ymm2 left by amount 
specified in xmm3/m128 while shifting in Os 
under writemask k1. 

EVEX.NDS.512.66.0F.W0 F2 /r 

VPSLLD zmmi [k1 }{z}, zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512F 

Shift doublewords in zmm2 left by amount 
specified in xmm3/m128 while shifting in Os 
under writemask k1. 

EVEX.NDD.128.66.0F.W0 72 /6 lb 

VPSLLD xmmi {k1]{z}, xmm2/m128/m32bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in xmm2/m128/m32bcst left 
by imm8 while shifting in Os using writemask k1. 

EVEX.NDD.256.66.0F.W0 72 /6 lb 

VPSLLD ymmi {k1 }[z}, ymm2/m256/m32bcst, 
imm8 

FVI 

V/V 

AVX512VL 
AVX512F 

Shift doublewords in ymm2/m256/m32bcst left 
by imm8 while shifting in Os using writemask k1. 

EVEX.NDD.512.66.0F.W0 72 /6 lb 

VPSLLD zmmi [k1}{z}, zmm2/m512/m32bcst, 
imm8 

FVI 

V/V 

AVX512F 

Shift doublewords in zmm2/m512/m32bcst left 
by imm8 while shifting in Os using writemask k1. 

EVEX.NDS.128.66.0F.W1 F3/r 

VPSLLQ xmmi {k1}[z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 
AVX512F 

Shift guadwords in xmm2 left by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDS.256.66.0F.W1 F3 /r 

VPSLLQ ymmi [k1 }{z}, ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 
AVX512F 

Shift guadwords in ymm2 left by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDS.512.66.0F.W1 F3/r 

VPSLLQ zmmi {k1 }{z}, zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512F 

Shift guadwords in zmm2 left by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 
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EVEX.NDD.128.66.0F.W1 73 /6 lb 

VPSLLQ xmmi {k1}{z}, xmm2/m128/m64bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift quadwords in xmm2/m128/m64bcst left 
by imm8 while shifting in Os using writemask k1. 

EVEX.NDD.256.66.0F.W1 73 /6 lb 

VPSLLQ ymmi [k1 }{z}, ymm2/m256/m64bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift quadwords in ymm2/m256/m64bcst left 
by imm8 while shifting in Os using writemask k1. 

EVEX.NDD.512.66.0F.W1 73 /6 lb 

VPSLLQ zmmi [k1}{z}, zmm2/m512/m64bcst, 
imm8 

FVI 

V/V 

AVX512F 

Shift quadwords in zmm2/m512/m64bcst left 
by imm8 while shifting in Os using writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

Ml 

ModRM:r/m (r, w) 

imm8 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

VMI 

VEX.vvvv (w) 

ModRM:r/m (r) 

imm8 

NA 

FVMI 

EVEX.vvvv (w) 

ModRM:r/m (R) 

ImmS 

NA 

FVI 

EVEX.vvvv (w) 

ModRM:r/m (R) 

Imm8 

NA 

Ml 28 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Shifts the bits in the individual data elements (words, doublewords, or quadword) in the destination operand (first 
operand) to the left by the number of bits specified in the count operand (second operand). As the bits in the data 
elements are shifted left, the empty low-order bits are cleared (set to 0). If the value specified by the count 
operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand 
is set to all Os. Figure 4-17 gives an example of shifting words in a 64-bit operand. 



Figure 4-17. PSLLW, PSLLD, and PSLLQ Instruction Operation Using 64-bit Operand 


The (V)PSLLW instruction shifts each of the words in the destination operand to the left by the number of bits spec¬ 
ified in the count operand; the (V)PSLLD instruction shifts each of the doublewords in the destination operand; and 
the (V)PSLLQ instruction shifts the quadword (or quadwords) in the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instructions 64-bit operand: The destination operand is an MMX technology register; the count 
operand can be either an MMX technology register or an 64-bit memory location. 
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128-bit Legacy SSE version: The destination and first source operands are XMM registers. Bits (VLMAX-1:128) of 
the corresponding VMM destination register remain unchanged. The count operand can be either an XMM register 
or a 128-bit memory location or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded 
but the upper 64 bits are ignored. 

VEX.128 encoded version: The destination and first source operands are XMM registers. Bits (VLMAX-1:128) of the 
destination VMM register are zeroed. The count operand can be either an XMM register or a 128-bit memory loca¬ 
tion or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are 
ignored. 

VEX.256 encoded version: The destination operand is a VMM register. The source operand is a VMM register or a 
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit imme¬ 
diate. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. 

EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count 
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a 
memory location (the variable count version). For the immediate count version, the source operand (the second 
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit 
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register, 
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location. 

Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination 
register, and VEX.B/EVEX.B -i- ModRM.r/m encodes the source register. 

Note: For shifts with an immediate count (VEX.128.66.OF 71-73 /6, or EVEX.128.66.OF 71-73 /6), 

VEX.vvvv/EVEX.vvvv encodes the destination register. 

Operation 

PSLLW (with 64-bit operand) 

IF (COUNT > 15) 

THEN 

DEST[64:0] ^ OOOOOOOOOOOOOOOOH; 

ELSE 

DEST[15:0] ^ ZeroExtend(DEST[15:0]« COUNT); 

(* Repeat shift operation for 2nd and 3rd words *) 

DEST[63:48] ^ ZeroExtend(DEST[63:48] « COUNT); 

FI; 

PSLLD (with 64-bit operand) 

IF (COUNT >31) 

THEN 

DEST[64:0] ^ OOOOOOOOOOOOOOOOH; 

ELSE 

DEST[31:0] ^ ZeroExtend(DEST[31:0]« COUNT); 

DEST[63:32] ^ ZeroExtend(DEST[63:32] « COUNT); 

FI; 

PSLLQ (with 64-bit operand) 

IF (COUNT >63) 

THEN 

DEST[64:0] ^ OOOOOOOOOOOOOOOOH; 

ELSE 

BEST ^ ZeroExtend(DEST « COUNT); 

FI; 

LOGICAL_LEFT_SHIFT_WORDS(SRC,COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 15) 

THEN 
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DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[15:0] ^ZeroExtend(SRC[15:0] << COUNT); 

(* Repeat shift operation for 2nd through 7th words *) 

DEST[127:112] ^ZeroExtend(SRC[127:112] « COUNT); 

FI; 

L0GICAL_LEFT_SHIFT_DW0RDS1 (SRC, COUNT_SRC) 

COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT >31) 

THEN 

DEST[31:0]^0 

ELSE 

DEST[31:0] <- ZeroExtend(SRC[31:0] « COUNT); 

FI; 

LOGICAL_LEFT_SHIFT_DWORDS(SRC, COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT >31) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[31:0] ^ZeroExtend(SRC[31:0] << COUNT); 

(* Repeat shift operation for 2nd through 3rd words *) 

DEST[127:96] ^ZeroExtend(SRC[127:96] « COUNT); 

FI; 

L0GICAL_LEFT_SHIFT_QW0RDS1 (SRC, COUNT_SRC) 

COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[63:0] <- 0 
ELSE 

DEST[63:0] <- ZeroExtend(SRC[63:0] « COUNT); 

FI; 

LOGICAL_LEFT_SHIFT_QWORDS(SRC,COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[63:0] ^ZeroExtend(SRC[63:0] << COUNT); 

DEST[127:64] ^ZeroExtend(SRC[127:64] « COUNT); 

FI; 

L0GICAL_LEFT_SHIFT_W0RDS_256b(SRC,C0UNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 15) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
DEST[255:128] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[15:0] ^ZeroExtend(SRC[15:0] << COUNT); 

(* Repeat shift operation for 2nd through 15th words *) 
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DEST[255:240] ^ZeroExtend(SRC[255:240] « COUNT); 

FI; 

LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC, COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 31) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
DEST[255:128] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[31:0] ^ZeroExtend(SRC[31:0] « COUNT); 

(* Repeat shift operation for 2nd through 7th words *) 
DEST[255:224] ^ZeroExtend(SRC[255:224] « COUNT); 

FI; 

LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC,COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
DEST[255:128] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[63:0] ^ZeroExtend(SRC[63:0] « COUNT); 

DEST[127:64] ^ZeroExtend(SRC[127:64] « COUNT) 

DEST[191:128] ^ZeroExtend(SRC[191:128] « COUNT); 
DEST[255:192] ^ZeroExtend(SRC[255:192] « COUNT); 

FI; 


VPSLLW (EVEX versions, xmm/ml 28) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[127:0] <- L0GICAL_LEFT_SHIFT_W0RDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- L0GICAL_LEFT_SHIFT_W0RDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- L0GICAL_LEFT_SHIFT_W0RDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- L0GICAL_LEFT_SHIFT_W0RDS_256b(SRC1 [511:256], SRC2) 
FI; 

FOR] ^0 TO KL-1 
I ^j* 16 

IF k1 [j] OR *no writemask* 

THEN DEST[l+15:i] ^ TMP_DEST[I+15:I] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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VPSLLW (EVEX versions, irnmS) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[127:0] <- LOGICAL_LEFT_SHIFT_WORDS_128b(SRC1 [127:0], Imm8) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1 [255:0], imm8) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- L0GICAL_LEFT_SHIFT_W0RDS_256b(SRC1 [255:0], Imm8) 
TMP_DEST[511:256] <- L0GICAL_LEFT_SHIFT_W0RDS_256b(SRC1 [511:256], Imm8) 
FI; 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k10] OR *no wrltemask* 

THEN DEST[i+15:1] ^ TMP_DEST[i+15:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 


VPSLLW (ymm, ymm, xmm/ml 28) - VEX.256 encoding 

DEST[255:0] ^LOGICAL_LEFT_SHIFT_WORDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256] ^0; 


VPSLLW (ymm, immS) - VEX.256 encoding 

DEST[255:0] ^L0GICAL_LEFT_SHIFT_W0RD_256b(SRC1, imm8) 
DEST[MAX_VL-1:256] ^0; 


VPSLLW (xmm, xmm, xmm/ml 28) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_LEFT_SHIFT_W0RDS(SRC1, SRC2) 
DEST[MAX_VL-1:128] ^0 


VPSLLW (xmm, imm8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_LEFT_SHIFT_W0RDS(SRC1, imm8) 
DEST[MAX_VL-1:128] ^0 
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PSLLW (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^LOGICAL_LEFT_SHIFT_WORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

PSLLW (xmm, immS) 

DEST[127:0] ^LOGICAL_LEFT_SHIFT_WORDS(DEST, imm8) 

DEST[MAX_VL-1:128] (Unmodified) 

VPSLLD (EVEX versions, imm8) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC1 *ls memory*) 

THEN DEST[I+31 :i] ^ L0GICAL_LEFT_SHIFT_DW0RDS1 (SRC1 [31:0], Imm8) 
ELSE DEST[i+31:i] ^ L0GICAL_LEFT_SHIFT_DW0RDS1(SRC1[i+31:i], Imm8) 
FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] <- 0 


VPSLLD (EVEX versions, xmm/ml 28) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

IFVL= 128 

TMP_DEST[127:0] <- L0GICAL_LEFT_SHIFT_DW0RDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- L0GICAL_LEFT_SHIFT_DW0RDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- L0GICAL_LEFT_SHIFT_DW0RDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- L0GICAL_LEFT_SHIFT_DW0RDS_256b(SRC1 [511:256], SRC2) 
FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i] ^ TMP_DEST[i+31 :i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 
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VPSLLD (ymm, ymm, xmm/ml ZB) - \/EX.256 encoding 

DEST[255:0] ^L0GICAL_LEFT_SHIFT_DW0RDS_256b(SRC1, SRC2) 

DEST[MAX_VL-1:256] ^0; 

VPSLLD (ymm, immS) - VEX.256 encoding 

DEST[255:0] ^L0GICAL_LEFT_SHIFT_DW0RDS_256b(SRC1, ImmS) 

DEST[MAX_VL-1:256] ^0; 

VPSLLD (xmm, xmm, xmm/ml 28) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_LEFT_SHIFT_DW0RDS(SRC1, SRC2) 

DEST[MAX_VL-1:128] ^0 

VPSLLD (xmm, imm8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_LEFT_SHIFT_DW0RDS(SRC1, imm8) 

DEST[MAX_VL-1:128] ^0 

PSLLD (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^LOGICAL_LEFT_SHIFT_DWORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

PSLLD (xmm, imm8) 

DEST[127:0] ^LOGICAL_LEFT_SHIFT_DWORDS(DEST, imm8) 

DEST[MAX_VL-1:128] (Unmodified) 

VPSLLQ (EVEX versions, imm8) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k10] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC1 *is memory*) 

THEN DEST[i+63:i] ^ LOGICAL_LEFT_SHIFT_QWORDS1(SRC1[63:0], imm8) 
ELSE DEST[i+63:i] ^ L0GICAL_LEFT_SHIFT_QW0RDS1 (SRC1 [i+63:i], imm8) 
FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 


VPSLLQ (EVEX versions, xmm/ml 28) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IFVL= 128 

TMP_DEST[127:0] <- L0GICAL_LEFT_SHIFT_QW0RDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] ^LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] ^L0GICAL_LEFT_SHIFT_QW0RDS_256b(SRC1 [511:256], SRC2) 
FI; 
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FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TMP_DEST[I+63:I] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 

VPSLLQ (ymm, ymm, xmm/ml 28) - VEX.ZSG encoding 

DEST[255:0] ^LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1, SRC2) 

DEST[MAX_VL-1:256] ^0; 

VPSLLQ (ymm, imm8) - VEX.256 encoding 

DEST[255:0] ^LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC1, imm8) 

DEST[MAX_VL-1:256] ^0; 

VPSLLQ (xmm, xmm, xmm/ml 28) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_LEFT_SHIFT_QW0RDS(SRC1, SRC2) 

DEST[MAX_VL-1:128] ^0 

VPSLLQ (xmm, imm8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_LEFT_SHIFT_QW0RDS(SRC1, imm8) 

DEST[MAX_VL-1:128] ^0 

PSLLQ (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^LOGICAL_LEFT_SHIFT_QWORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

PSLLQ (xmm, imm8) 

DEST[127:0] ^LOGICAL_LEFT_SHIFT_QWORDS(DEST, imm8) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C-r-i- Compiler Intrinsic Equivalents 

VPSLLD_mSI 21 _mm512_slli_epi32(_mSI 2i a, unsigned inf imm); 

VPSLLD_mSI 2i_mm512_mask_slli_epi32(_mSI 2i s,_mmask16 k,_mSI 2i a, unsigned inf imm); 

VPSLLD_m512i_mm512_maskz_slli_epi32(_mmasklG k,_m512i a, unsigned int imm); 

VPSLLD_m256i _mm256_mask_slli_epi32(_m256i s,_mmask8 k,_m256i a, unsigned int imm); 

VPSLLD_m256i _mm256_maskz_slli_epi32(_mmask8 k,_m256i a, unsigned int imm); 

VPSLLD_ml 28i _mm_mask_slli_epi32(_ml 28i s,_mmask8 k,_ml 28i a, unsigned int imm); 

VPSLLD_ml 28i _mm_maskz_slli_epi32(_mmask8 k,_ml 28i a, unsigned int imm); 

VPSLLD _m512i _mm512_slLepi32(_m512i a,_m128i cnt); 

VPSLLD_mSI 2i _mm512_mask_sll_epi32(_m512i s,_mmask16 k,_mSIZi a,_m128i cnt); 

VPSLLD_mSI2i_mm512_maskz_sll_epi32(_mmasklG k,_m512i a,_ml28i cnt); 

VPSLLD_m256i _mm256_mask_sll_epi32(_m256i s,_mmask8 k,_m256i a,_m128i cnt); 

VPSLLD_m256i _mm256_maskz_sll_epi32(_mmask8 k,_m256i a,_ml 28i cnt); 

VPSLLD_ml 28i _mm_mask_sll_epi32(_ml 28i s,_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSLLD_ml 28i _mm_maskz_sll_epi32(_mmask8 k,_m128i a,_m128i cnt); 
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VPSLLQ_m5121 _mm512_mask_slll_epl64(_mSI 21 a, unsigned int imm); 

VPSLLQ_m512i_mm512_mask_slli_epi64(_mSI 2i s,_mmaskS k,_mSI 21 a, unsigned int imm); 

VPSLLQ_m512i _mm512_maskz_slli_epi64(_mmaskS k,_m512i a, unsigned int imm); 

VPSLLQ_m256i _mm256_mask_slli_epi64(_m256i s,_mmaskS k,_m256i a, unsigned int imm); 

VPSLLQ_m256i _mm256_maskz_slli_epi64(_mmaskS k,_m256i a, unsigned int imm); 

VPSLLQ_ml 28i _mm_mask_slli_epi64(_ml 28i s,_mmaskS k,_ml 28i a, unsigned int imm); 

VPSLLQ_ml 28i _mm_maskz_slli_epi64(_mmaskS k,_ml 28i a, unsigned int imm); 

VPSLLQ _m512i_mm512_mask_slLepi64(_m512i a,_m128i cnt); 

VPSLLQ_m512i _mm512_mask_sll_epi64(_m512i s,_mmaskS k,_mSI 2i a,_ml 28i cnt); 

VPSLLQ_m512i _mm512_maskz_sll_epi64(_mmaskS k,_mSI 2i a,_ml 28i cnt); 

VPSLLQ_m256i _mm256_mask_sll_epi64(_m256i s,_mmaskS k,_m256i a,_ml 28i cnt); 

VPSLLQ_m256i _mm256_maskz_sll_epi64(_mmaskS k,_m256i a,_ml 28i cnt); 

VPSLLQ_ml 28i _mm_mask_sll_epi64(_ml 28i s,_mmaskS k,_m128i a,_ml 28i cnt); 

VPSLLQ_ml 28i _mm_maskz_sll_epi64(_mmaskS k,_ml 28i a,_ml 28i cnt); 

VPSLLW_mSI 2i_mm512_slli_epi16(_mSI 2i a, unsigned int imm); 

VPSLLW_mSI 2i_mm512_mask_slli_epi16(_m512i s,_mmask32 k,_mSI 2i a, unsigned int imm); 

VPSLLW_mSI 2i_mm512_maskz_slli_epi16(_mmask32 k,_m512i a, unsigned int imm); 

VPSLLW_m256i _mm256_mask_sllii_epi16(_m256i s,_mmaski 6 k,_m256i a, unsigned int imm); 

VPSLLW_m256i _mm256_maskz_slli_epi16(_mmaski 6 k,_m256i a, unsigned int imm); 

VPSLLW_ml 28i_mm_mask_slli_epi16(_ml 28i s,_mmaskS k,_ml 28i a, unsigned int imm); 

VPSLLW_m128i_mm_maskz_slli_epi16(_mmaskS k,_ml 28i a, unsigned int imm); 

VPSLLW _m512i _mm512_slLepi16(_m512i a_ml 28i cnt); 

VPSLLW_mSI 2i_mm512_mask_sll_epi16(_m512i s,_mmask32 k,_mSI 2i a,_ml 28i cnt); 

VPSLLW_mSI 2i_mm512_maskz_sll_epi16(_mmask32 k,_mSI 2i a,_ml 28i cnt); 

VPSLLW_m256i_mm256_mask_sll_epi16(_m256i s,_mmaski 6 k,_m256i a,_ml 28i cnt); 

VPSLLW_m256i_mm256_maskz_sll_epi16(_mmaski 6 k,_m256i a,_ml 28i cnt); 

VPSLLW_m128i_mm_mask_sll_epi16(_m128i s,_mmaskS k,_m128i a,_ml 28i cnt); 

VPSLLW_m128i_mm_maskz_sll_epi16(_mmaskS k,_m128i a,_m128i cnt); 

PSLLW:_m64 _mm_slli_pi16 (_m64 m, int count) 

PSLLW:_m64 _mm_sll_pi16(_m64 m,_m64 count) 

(V)PSLLW:_ml 28i _mm_slli_pi16(_m64 m, int count) 

(V)PSLLW:_ml 28i _mm_slli_pi16(_ml 28i m,_ml 28i count) 

VPSLLW:_m256i _mm256_slli_epi16 (_m256i m, int count) 

VPSLLW:_m256i _mm256_sll_epi16 (_m256i m,_ml 28i count) 

PSLLD:_m64 _mm_slli_pi32(_m64 m, int count) 

PSLLD:_m64 _mm_sll_pi32(_m64 m,_m64 count) 

(V)PSLLD:_ml 28i _mm_slli_epi32(_ml 28i m, int count) 

(V)PSLLD:_ml 28i _mm_sll_epi32(_ml 28i m,_ml 28i count) 

VPSLLQ:_m256i _mm256_slli_epi32 (_m256i m, int count) 

VPSLLQ:_m256i _mm256_sll_epi32 (_m256i m,_ml 28i count) 

PSLLQ:_m64 _mm_slli_si64(_m64 m, int count) 

PSLLQ:_m64 _mm_sll_si64(_m64 m,_m64 count) 

(V)PSLLQ:_ml 28i _mm_slli_epi64(_ml 28i m, int count) 

(V)PSLLQ:_ml 28i _mm_sll_epi64(_ml 28i m,_ml 28i count) 

VPSLLQ:_m256i _mm256_slli_epi64 (_m256i m, int count) 

VPSLLQ:_m256i _mm256_sll_epi64 (_m256i m,_ml 28i count) 


Flags Affected 

None. 


Numeric Exceptions 

None. 
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Other Exceptions 

VEX-encoded instructions: 

Syntax with RM/RVM operand encoding, see Exceptions Type 4. 
Syntax with MI/VMI operand encoding, see Exceptions Type 7. 

EVEX-encoded VPSLLW, see Exceptions Type E4NF.nb. 

EVEX-encoded VPSLLD/Q: 

Syntax with M128 operand encoding, see Exceptions Type E4NF.nb. 
Syntax with FVI operand encoding, see Exceptions Type E4. 
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PSRAW/PSRAD/PSRAQ—Shift Packed Data Right Arithmetic 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF El /r' 

PSRAW mm, mm/m64 

RM 

V/V 

MMX 

Shift words in mm right by mm/m64 while shifting 
in sign bits. 

66 OF El /r 

PSRAW xmm 1, xmmZ/m 128 

RM 

v/v 

SSE2 

Shift words in xmml right by xmm2/m728 while 
shifting in sign bits. 

OF 71 /4ib ' 

PSRAW mm, immS 

Ml 

V/V 

MMX 

Shift words in mm right by imm8 while shifting in 
sign bits 

66 OF 71 /4ib 

PSRAW xmml, immS 

Ml 

v/v 

SSE2 

Shift words in xmml right by imm8 while shifting 
in sign bits 

OF E2 /r' 

PSRAD mm, mm/m64 

RM 

v/v 

MMX 

Shift doublewords in mm right by mm/m64 while 
shifting in sign bits. 

66 OF E2 /r 

PSRAD xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Shift doubleword in xmml right by xmm2 /ml28 
while shifting in sign bits. 

OF 72 /4 ib' 

PSRAD mm, imm8 

Ml 

v/v 

MMX 

Shift doublewords in mm right by /mmS while 
shifting in sign bits. 

66 OF 72 /4 ib 

PSRAD xmml, immS 

Ml 

v/v 

SSE2 

Shift doublewords in xmml right by /mmS while 
shifting in sign bits. 

VEX.NDS.128.66.0F.WIGE1 /r 

VPSRAW xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift words in xmm2 right by amount specified in 
xmm3/rn 728 while shifting in sign bits. 

VEX.NDD.128.66.0F.WIG 71 /4 ib 

VPSRAW xmm 1, xmm2, imm8 

VMI 

v/v 

AVX 

Shift words in xmm2 right by imm8 while shifting 
in sign bits. 

VEX.NDS.128.66.0F.WIGE2 /r 

VPSRAD xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift doublewords in xmm2 right by amount 
specified in xmm3/m 7 28 while shifting in sign 
bits. 

VEX.NDD.128.66.0F.WIG 72 /4 ib 

VPSRAD xmm 1, xmm2, imm8 

VMI 

v/v 

AVX 

Shift doublewords in xmm2 right by /mm8 while 
shifting in sign bits. 

VEX.NDS.256.66.0F.WIG El /r 

VPSRAW ymm 1, ymm2, xmm3/m 128 

RVM 

v/v 

AVX2 

Shift words in ymm2 right by amount specified in 
xmm3/rn728while shifting in sign bits. 

VEX.NDD.256.66.0F.WIG 71 /4 ib 

VPSRAW ymm 1, ymm2, imm8 

VMI 

v/v 

AVX2 

Shift words in ymm2 right by /mm8 while shifting 
in sign bits. 

VEX.NDS.256.66.0F.WIG E2 /r 

VPSRAD ymm 1, ymm2, xmm3/m 128 

RVM 

v/v 

AVX2 

Shift doublewords in ymm2 right by amount 
specified in xmm3/m 7 28 while shifting in sign 
bits. 

VEX.NDD.256.66.0F.WIG 72 /4 ib 

VPSRAD ymml, ymm2, imm8 

VMI 

v/v 

AVX2 

Shift doublewords in ymm2 right by imm8 while 
shifting in sign bits. 

EVEX.NDS.128.66.0F.WIGE1 /r 

VPSRAW xmml [k1 }{z}, xmm2, xmm3/m128 

Ml 28 

v/v 

AVX512VL 

AVX512BW 

Shift words in xmm2 right by amount specified in 
xmm3/m128 while shifting in sign bits using 
writemask k1. 

EVEX.NDS.256.66.0F.WIG El /r 

VPSRAW ymml [l<1}[z}, ymm2, xmm3/m128 

Ml 28 

v/v 

AVX512VL 

AVX512BW 

Shift words in ymm2 right by amount specified in 
xmm3/m128 while shifting in sign bits using 
writemask k1. 

EVEX.NDS.512.66.0F.WIGE1 /r 

VPSRAW zmmi {k1]{z}, zmm2, xmm3/m128 

Ml 28 

v/v 

AVX512BW 

Shift words in zmm2 right by amount specified in 
xmm3/m128 while shifting in sign bits using 
writemask k1. 
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EVEX.NDD.1 28.66.0F.WIG 71 /4 ib 

VPSRAW xmmi {k1 }{z}, xmm2/m128, imm8 

FVMI 

V/V 

AVX512VL 

AVX512BW 

Shift words in xmm2/m128 right by imm8 while 
shifting in sign bits using writemask k1. 

EVEX.NDD.256.66.0F.WIG71 /4 ib 

VPSRAW ymmi [k1}[z}, ymm2/m256, imm8 

FVMI 

V/V 

AVX512VL 

AVX512BW 

Shift words in ymm2/m256 right by imm8 while 
shifting in sign bits using writemask k1. 

EVEX.NDD.51 2.66.0F.WIG 71 /4 ib 

VPSRAW zmmi [k1 }[z}, zmm2/m512, imm8 

FVMI 

V/V 

AVX512BW 

Shift words in zmm2/m512 right by imm8 while 
shifting in sign bits using writemask k1. 

EVEX.NDS.128.66.0F.W0 E2 /r 

VPSRAD xmmi {k1}[z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in xmm2 right by amount 
specified in xmm3/m128 while shifting in sign bits 
using writemask k1. 

EVEX.NDS.256.66.0F.W0 E2 /r 

VPSRAD ymmi [k1 ][z], ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in ymm2 right by amount 
specified in xmm3/m128 while shifting in sign bits 
using writemask k1. 

EVEX.NDS.512.66.0F.W0 E2 /r 

VPSRAD zmmi {k1}{z}, zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512F 

Shift doublewords in zmm2 right by amount 
specified in xmm3/m128 while shifting in sign bits 
using writemask k1. 

EVEX.NDD.1 28.66.0F.W0 72 /4 ib 

VPSRAD xmmi {k1}[z}, xmm2/m128/m32bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in xmm2/m128/m32bcst right 
by imm8 while shifting in sign bits using 
writemask k1. 

EVEX.NDD.256.66.0F.W0 72 /4 ib 

VPSRAD ymmi [k1 }[z}, ymm2/m256/m32bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in ymm2/m256/m32bcst right 
by imm8 while shifting in sign bits using 
writemask k1. 

EVEX.NDD.51 2.66.0F.W0 72 /4 ib 

VPSRAD zmmi {k1}{z}, zmm2/m512/m32bcst, 
imm8 

FVI 

V/V 

AVX512F 

Shift doublewords in zmm2/m512/m32bcst right 
by immB while shifting in sign bits using 
writemask k1. 

EVEX.NDS.128.66.0F.W1 E2/r 

VPSRAQ xmmi {k1}{z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift guadwords in xmm2 right by amount 
specified in xmm3/m128 while shifting in sign bits 
using writemask k1. 

EVEX.NDS.256.66.0F.W1 E2 /r 

VPSRAQ ymmi [k1}[z}, ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift guadwords in ymm2 right by amount 
specified in xmm3/m128 while shifting in sign bits 
using writemask k1. 

EVEX.NDS.512.66.0F.W1 E2/r 

VPSRAQ zmmi [k1 }[z}, zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512F 

Shift guadwords in zmm2 right by amount 
specified in xmm3/m128 while shifting in sign bits 
using writemask k1. 

EVEX.NDD.1 28.66.0F.W1 72 /4 ib 

VPSRAQ xmmi {k1 }{z}, xmm2/m128/m64bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift guadwords in xmm2/m128/m64bcst right by 
immB while shifting in sign bits using writemask 
k1. 

EVEX.NDD.256.66.0F.W1 72 /4 ib 

VPSRAQ ymmi [k1 }[z}, ymm2/m256/m64bcst, 
imm8 

FVI 

V/V 

AVX512VL 

AVX512F 

Shift guadwords in ymm2/m256/m64bcst right by 
immB while shifting in sign bits using writemask 
k1. 

EVEX.NDD.512.66.0F.W1 72/4 ib 

VPSRAQ zmmi {k1}[z}, zmm2/m512/m64bcst, 
imm8 

FVI 

V/V 

AVX512F 

Shift guadwords in zmm2/m512/m64bcst right by 
imm8 while shifting in sign bits using writemask 
k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" in 
the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 
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Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

Ml 

ModRM:r/m (r, w) 

imm8 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

VMI 

VEX.vvvv (w) 

ModRM:r/m (r) 

imm8 

NA 

FVMI 

EVEX.vvvv (w) 

ModRM:r/m (R) 

Imm8 

NA 

FVI 

EVEX.vvvv (w) 

ModRM:r/m (R) 

Imm8 

NA 

Ml 28 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Shifts the bits in the individual data elements (words, doublewords or quadwords) in the destination operand (first 
operand) to the right by the number of bits specified in the count operand (second operand). As the bits in the data 
elements are shifted right, the empty high-order bits are filled with the initial value of the sign bit of the data 
element. If the value specified by the count operand is greater than 15 (for words), 31 (for doublewords), or 63 (for 
quadwords), each destination data element is filled with the initial value of the sign bit of the element. (Figure 4-18 
gives an example of shifting words in a 64-bit operand.) 



Figure 4-18. PSRAW and PSRAD Instruction Operation Using a 64-bit Operand 


Note that only the first 64-bits of a 128-bit count operand are checked to compute the count. If the second source 
operand is a memory address, 128 bits are loaded. 

The (V)PSRAW instruction shifts each of the words in the destination operand to the right by the number of bits 
specified in the count operand, and the (V)PSRAD instruction shifts each of the doublewords in the destination 
operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instructions 64-bit operand: The destination operand is an MMX technology register; the count 
operand can be either an MMX technology register or an 64-bit memory location. 

128-bit Legacy SSE version: The destination and first source operands are XMM registers. Bits (VLMAX-1:128) of 
the corresponding VMM destination register remain unchanged. The count operand can be either an XMM register 
or a 128-bit memory location or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded 
but the upper 64 bits are ignored. 

VEX. 128 encoded version: The destination and first source operands are XMM registers. Bits (VLMAX-1:128) of the 
destination VMM register are zeroed. The count operand can be either an XMM register or a 128-bit memory loca¬ 
tion or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are 
ignored. 

VEX.256 encoded version: The destination operand is a VMM register. The source operand is a VMM register or a 
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit 
immediate. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. 
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EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count 
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a 
memory location (the variable count version). For the immediate count version, the source operand (the second 
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit 
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register, 
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location. 

Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination 
register, and VEX.B/EVEX.B -i- ModRM.r/m encodes the source register. 

Note: For shifts with an immediate count (VEX.128.66.OF 71-73 /4, EVEX.128.66.OF 71-73 /4), 

VEX.vvvv/EVEX.vvvv encodes the destination register. 

Operation 

PSRAW (with 64-bit operand) 

IF (COUNT > 15) 

THEN COUNT ^ 16; 

FI; 

DEST[15:0] ^ SignExtend(DEST[15:0] » COUNT); 

(* Repeat shift operation for 2nd and 3rd words *) 

DEST[63:48] ^ SignExtend(DEST[63:48]» COUNT); 


PSRAD (with 64-bit operand) 

IF (COUNT >31) 

THEN COUNT ^ 32; 

FI; 

DEST[31:0] ^ SignExtend(DEST[31:0] » COUNT); 
DEST[63:32] ^ SignExtend(DEST[63:32]» COUNT); 

ARITHMETIC_RIGHT_SHIFT_DW0RDS1 (SRC, COUNT_SRC) 
COUNT ^ COUNT_SRC[63:0]; 

IF (COUNT > 31) 

THEN 

DEST[31:0] ^SlgnBIt 
ELSE 

DEST[31:0] <- SlgnExtend(SRC[31:0] » COUNT); 

FI; 

ARITHMETIC_RIGHT_SHIFT_QW0RDS1 (SRC, C0UNT_SRC) 
COUNT <- COUNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[63:0] <- SIgnBIt 
ELSE 

DEST[63:0] <- SlgnExtend(SRC[63:0] » COUNT); 

FI; 

ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC, C0UNT_SRC) 
COUNT <- COUNT_SRC[63:0]; 

IF (COUNT > 15) 

THEN COUNT <- 16; 

FI; 

DEST[15:0] <- SignExtend(SRC[15:0] >> COUNT); 

(* Repeat shift operation for 2nd through 15th words *) 
DEST[255:240] <- SignExtend(SRC[255:240] » COUNT); 
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ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC, COUNT_SRC) 
COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT >31) 

THEN COUNT <- 32; 

FI; 

DEST[31:0] <- SignExtend(SRC[31:0] >> COUNT); 

(* Repeat shift operation for 2nd through 7th words *) 
DEST[255:224] <- SlgnExtend(SRC[255:224] » COUNT); 

ARITHMETIC_RIGHT_SHIFT_QWORDS(SRC, COUNT_SRC, VL) 
COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN COUNT <- 64; 

FI; 

DEST[63:0] <- SignExtend(SRC[63:0] >> COUNT); 

(* Repeat shift operation for 2nd through 7th words *) 
DEST[VL-1:VL-64] <- SlgnExtend(SRC[VL-1 :VL-64] >> COUNT); 


ARITHMETIC_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC) 
COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT > 15) 

THEN COUNT ^16; 

FI; 

DEST[15:0] <- SignExtend(SRC[15:0] >> COUNT); 

(* Repeat shift operation for 2nd through 7th words *) 
DEST[127:112] <- SlgnExtend(SRC[127:112] » COUNT); 

ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC) 
COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT >31) 

THEN COUNT <- 32; 

FI; 

DEST[31:0] <- SignExtend(SRC[31:0] >> COUNT); 

(* Repeat shift operation for 2nd through 3rd words *) 
DEST[127:96] <- SignExtend(SRC[127:96] » COUNT); 


;VL: 128b, 256b or 512b 
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VPSRAW (EVEX versions, xmm/ml 28) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[127:0] <- ARITHMETIC_RICHT_SHIFT_W0RDS_1 28b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- ARITHMETIC_RICHT_SHIFT_W0RDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- ARITHMETIC_RICHT_SHIFT_W0RDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1 [511:256], SRC2) 
FI; 

FOR] ^0 TO KL-1 
I ^]* 16 

IF k1 [j] OR *no writemask* 

THEN DEST[l+15:i] ^ TMP_DEST[I+15:I] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VPSRAW (EVEX versions, imm8) 

(KL, VL) = (8, 128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[127:0] <- ARITHMETIC_RICHT_SHIFT_W0RDS_128b(SRC1 [127:0], imm8) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- ARITHMETIC_RICHT_SHIFT_W0RDS_256b(SRC1 [255:0], Imm8) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- ARITHMETIC_RICHT_SHIFT_W0RDS_256b(SRC1 [255:0], Imm8) 
TMP_DEST[511:256] <- ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC1 [511:256], Imm8) 
FI; 

FOR] ^0 TO KL-1 
I ^]* 16 

IF k1 [j] OR *no writemask* 

THEN DEST[i+15:i] ^ TMP_DEST[i+15:i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] <- 0 
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VPSRAW (ymm, ymm, xmm/ml Z8) - VEX 

DEST[255:0] <- ARITHMETIC_RIGHT_SHIFT_W0RDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256]^0 


VPSRAW (ymm, imm8) - VEX 

DEST[255:0] <- ARITHMETIC_RIGHT_SHIFT_W0RDS_256b(SRC1, imm8) 
DEST[MAX_VL-1:256]^0 


VPSRAW (xmm, xmm, xmm/ml 28) - VEX 

DEST[127:0] <- ARITHMETIC_RIGHT_SHIFT_W0RDS(SRC1, SRC2) 
DEST[MAX_VL-1:1281^0 


VPSRAW (xmm, imm8) - VEX 

DEST[127:0] <- ARITHMETIC_RIGHT_SHIFT_W0RDS(SRC1, Imm8) 
DEST[MAX_VL-1:128]^0 


PSRAW (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^ARITHMETIC_RIGHT_SHIFT_WORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

PSRAW (xmm, imm8) 

DEST[127:0] ^ARITHMETIC_RIGHT_SHIFT_WORDS(DEST, Imm8) 

DEST[MAX_VL-1:128] (Unmodified) 

VPSRAD (EVEX versions, imm8) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^]*32 

IF k10] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC1 *is memory*) 

THEN DEST[i+31 :i] ^ ARITHMETIC_RIGHT_SHIFT_DW0RDS1 (SRC1 [31:0], imm8) 
ELSE DEST[i+31 :i] ^ ARITHMETIC_RIGHT_SHIFT_DWORDS1 (SRC1 [i+31 :i], imm8) 
FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 


VPSRAD (EVEX versions, xmm/ml 28) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

IFVL= 128 

TMP_DEST[127:0] <- ARITHMETIC_RICHT_SHIFT_DWORDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- ARITHMETIC_RICHT_SHIFT_DWORDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- ARITHMETIC_RIGHT_SHIFT_DW0RDS_256b(SRC1 [511:256], SRC2) 
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FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] <- 0 

VPSRAD (ymm, ymm, xmm/ml 28) - VEX 

DEST[255:0] ^ARITHMETIC_RIGHT_SHIFT_DW0RDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256]^0 

VPSRAD (ymm, imm8) - VEX 

DEST[255:0] ^ARITHMETIC_RIGHT_SHIFT_DW0RDS_256b(SRC1, Imm8) 
DEST[MAX_VL-1:256]^0 

VPSRAD (xmm, xmm, xmm/ml 28) - VEX 

DEST[127:0] ^ARITHMETIC_RIGHT_SHIFT_DW0RDS(SRC1, SRC2) 

DEST[MAX_VL-1:128)^0 

VPSRAD (xmm, imm8) - VEX 

DEST[127:0] ^ARITHMETIC_RIGHT_SHIFT_DW0RDS(SRC1, Imm8) 

DEST[MAX_VL-1:128] ^0 

PSRAD (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^ARITHMETIC_RIGHT_SHIFT_DWORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

PSRAD (xmm, imm8) 

DEST[127:0] ^ARITHMETIC_RIGHT_SHIFT_DWORDS(DEST, imm8) 

DEST[MAX_VL-1:128] (Unmodified) 

VPSRAQ (EVEX versions, imm8) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC1 *ls memory*) 

THEN DEST[l+63:i] ^ ARITHMETIC_RIGHT_SHIFT_QWORDS1(SRC1[63:0], Imm8) 
ELSE DEST[i+63:i] ^ ARITHMETIC_RIGHT_SHIFT_QW0RDS1(SRC1 [1+63:1], Imm8) 
FI; 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
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FI 

FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 

VPSRAQ (EVEX versions, xmm/ml 28) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

TMP_DEST[VL-1:0] <- ARITHMETIC_RIGHT_SHIFT_QWORDS(SRC1 [VL-1:0], SRC2, VL) 

FOR) ^0 TO 7 
i ^ j * 64 

IF k10] OR *no wrltemask* 

THEN DEST[i+63:l] ^ TMP_DEST[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 


Intel C/C-r-r Compiler Intrinsic Equivalents 

VPSRAD_m512i _mm512_srai_epi32(_m5121 a, unsigned int imm); 

VPSRAD_m5121 _mm512_mask_srai_epi32(_m5121 s,_mmaski 6 k,_m5121 a, unsigned int imm); 

VPSRAD_m5121 _mm512_maskz_srai_epi32(_mmaski 6 k,_m5121 a, unsigned int imm); 

VPSRAD_m256i _mm256_mask_srai_epi32(_m256i s,_mmaskS k,_m256i a, unsigned int imm); 

VPSRAD_m256i _mm256_maskz_srai_epi32(_mmaskS k,_m256i a, unsigned int imm); 

VPSRAD_ml 28i _mm_mask_srai_epi32(_ml 28i s,_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRAD_ml 28i _mm_maskz_srai_epi32(_mmask8 k,_m128i a, unsigned int imm); 

VPSRAD_m512i _mm512_sra_epi32(_m512i a,_ml 28i cnt); 

VPSRAD_m512i _mm512_mask_sra_epi32(_m512i s,_mmaski 6 k,_m512i a,_m128i cnt); 

VPSRAD_m512i _mm512_maskz_sra_epi32(_mmaski 6 k,_m512i a,_m128i cnt); 

VPSRAD_m256i _mm256_mask_sra_epi32(_m256i s,_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRAD_m256i _mm256_maskz_sra_epi32(_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRAD_ml 28i _mm_mask_sra_epi32(_ml 28i s,_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSRAD_ml 28i _mm_maskz_sra_epi32(_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSRAQ_m512i _mm512_srai_epi64(_m512i a, unsigned int imm); 

VPSRAQ_m512i _mm512_mask_srai_epi64(_m512i s,_mmask8 k,_m512i a, unsigned int imm) 

VPSRAQ_m512i _mm512_maskz_srai_epi64(_mmask8 k,_m512i a, unsigned int imm) 

VPSRAQ_m256i _mm256_mask_srai_epi64(_m256i s,_mmask8 k,_m256i a, unsigned int imm); 

VPSRAQ_m256i _mm256_maskz_srai_epi64(_mmask8 k,_m256i a, unsigned int imm); 

VPSRAQ_ml 28i _mm_mask_srai_epi64(_ml 28i s,_mmask8 k,_m128i a, unsigned int imm); 

VPSRAQ_ml 28i _mm_maskz_srai_epi64(_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRAQ_m512i _mm512_sra_epi64(_m512i a,_ml 28i cnt); 

VPSRAQ_m512i _mm512_mask_sra_epi64(_m512i s,_mmask8 k,_m512i a,_m128i cnt) 

VPSRAQ_m512i _mm512_maskz_sra_epi64(_mmask8 k,_m512i a,_ml 28i cnt) 

VPSRAQ_m256i _mm256_mask_sra_epi64(_m256i s,_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRAQ_m256i _mm256_maskz_sra_epi64(_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRAQ_ml 28i _mm_mask_sra_epi64(_ml 28i s,_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSRAQ_ml 28i _mm_maskz_sra_epi64(_mmask8 k,_m128i a,_m128i cnt); 

VPSRAW_m512i _mm512_srai_epi16(_m512i a, unsigned int imm); 

VPSRAW_m512i _mm512_mask_srai_epi16(_m512i s,_mmask32 k,_m512i a, unsigned int imm); 
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VPSRAW_mSI 2i _mm512_maskz_sral_epl16(_mmask32 k,_mSI 2i a, unsigned Int Imm); 

VPSRAW_m256l _mm256_mask_sral_epi16(_m256i s,_mmaski 6 k,_m256i a, unsigned int imm); 

VPSRAW_m256i_mm256_maskz_srai_epi16(_mmaski 6 k,_m256i a, unsigned int imm); 

VPSRAW_m128i _mm_mask_srai_epi16(_ml 28i s,_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRAW_m128i _mm_maskz_srai_epi16(_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRAW_mSI 2i _mm512_sra_epi16(_mSI 2i a,_ml 28i cnt); 

VPSRAW_mSI 2i _mm512_mask_sra_epi16(_mSI 2i s,_mmaski 6 k,_mSI 2i a,_ml 28i cnt); 

VPSRAW_mSI 2i _mm512_maskz_sra_epi16(_mmaski 6 k,_mSI 2i a,_ml 28i cnt); 

VPSRAW_m256i _mm256_mask_sra_epi16(_m256i s,_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRAW_m256i_mm256_maskz_sra_epi16(_mmask8 k,_m256i a,_m128i cnt); 

VPSRAW_ml 28i _mm_mask_sra_epi16(_ml 28i s,_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSRAW_ml 28i _mm_maskz_sra_epi16(_mmask8 k,_ml 28i a,_ml 28i cnt); 

PSRAW:_m64 _mm_srai_pi16 (_m64 m, int count) 

PSRAW:_m64 _mm_sra_pi16 (_m64 m,_m64 count) 

(V)PSRAW:_ml 28i _mm_srai_epi16(_ml 28i m, int count) 

(V)PSRAW:_ml 28i _mm_sra_epi16(_ml 28i m,_ml 28i count) 

VPSRAW:_m256i _mm256_srai_epi16 (_m256i m, int count) 

VPSRAW:_m256i _mm256_sra_epi16 (_m256i m,_ml 28i count) 

PSRAD:_m64_mm_srai_pi32 (_m64 m, int count) 

PSRAD:_m64 _mm_sra_pi32 (_m64 m,_m64 count) 

(V)PSRAD:_ml 28i _mm_srai_epi32 (_ml 28i m, int count) 

(V)PSRAD:_ml 28i _mm_sra_epi32 (_ml 28i m,_ml 28i count) 

VPSRAD:_m256i _mm256_srai_epi32 (_m256i m, int count) 

VPSRAD:_m256i _mm256_sra_epi32 (_m256i m,_ml 28i count) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

VEX-encoded instructions: 

Syntax with RM/RVM operand encoding, see Exceptions Type 4. 

Syntax with MI/VMI operand encoding, see Exceptions Type 7. 

EVEX-encoded VPSRAW, see Exceptions Type E4NF.nb. 

EVEX-encoded VPSRAD/Q: 

Syntax with M128 operand encoding, see Exceptions Type E4NF.nb. 

Syntax with FVI operand encoding, see Exceptions Type E4. 
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PSRLDQ—Shift Double Quadword Right Logical 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 73/3 lb 

PSRLDQ xmm 1, immS 

Ml 

V/V 

SSE2 

Shift xmmi right by /mmS while shifting in Os. 

VEX.NDD.128.66.0F.WIG 73 /3 ib 

VPSRLDQ xmm 1, xmmZ, imm8 

VMI 

v/v 

AVX 

Shift xmmZ right by /mmS bytes while shifting in 
Os. 

VEX.NDD.256.66.0F.WIG 73 /3 ib 

VPSRLDQ ymmi, ymmZ, imm8 

VMI 

V/V 

AVX2 

Shift ymmi right by /mmSbytes while shifting in 
Os. 

EVEX.NDD.128.66.0F.WIG 73 /3 ib 

VPSRLDQ xmmi, xmmZ/ml 28, immS 

FVM 

v/v 

AVX512VL 

AVX512BW 

Shift xmm2/m128 right by ImmS bytes while 
shifting in Os and store result in xmmi. 

EVEX.NDD.256.66.0F.WIG 73 /3 ib 

VPSRLDQ ymmi, ymm2/m256, imm8 

FVM 

v/v 

AVX512VL 

AVX512BW 

Shift ymm2/m256 right by immS bytes while 
shifting in Os and store result in ymmi. 

EVEX.NDD.512.66.0F.WIG 73 /3 ib 

VPSRLDQ zmmi, zmm2/m512, imm8 

FVM 

v/v 

AVX512BW 

Shift zmm2/m512 right by ImmB bytes while 
shifting in Os and store result in zmmi. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

Ml 

ModRM:r/m (r, w) 

ImmS 

NA 

NA 

VMI 

VEX.vvvv (w) 

ModRM:r/m (r) 

ImmS 

NA 

FVM 

EVEX.vvvv (w) 

ModRM:r/m (R) 

ImmB 

NA 


Description 

Shifts the destination operand (first operand) to the right by the number of bytes specified in the count operand 
(second operand). The empty high-order bytes are cleared (set to all Os). If the value specified by the count 
operand is greater than 15, the destination operand is set to all Os. The count operand is an 8-bit immediate. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

128-bit Legacy SSE version: The source and destination operands are the same. Bits (VLMAX-1:128) of the corre¬ 
sponding VMM destination register remain unchanged. 

VEX.128 encoded version: The source and destination operands are XMM registers. Bits (VLMAX-1:128) of the 
destination VMM register are zeroed. 

VEX.256 encoded version: The source operand is a VMM register. The destination operand is a VMM register. The 
count operand applies to both the low and high 128-bit lanes. 

VEX.256 encoded version: The source operand is VMM register. The destination operand is an VMM register. Bits 
(MAX_VL-1:256) of the corresponding ZMM register are zeroed. The count operand applies to both the low and 
high 128-bit lanes. 

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. 
The destination operand is a ZMM/YMM/XMM register. The count operand applies to each 128-bit lanes. 

Note: VEX.vvvv/EVEX.vvvv encodes the destination register. 
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Operation 

VPSRLDQ (EVEX.512 encoded version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP <- 16; FI 
DEST[127:0] <- SRC[127:0] » (TEMP * 8) 
DEST[255:128] <- SRC[255:128] >> (TEMP * 8) 
DEST[383:256] <- SRC[383:256] >> (TEMP * 8) 
DEST[511:384] <- SRC[511:384] > > (TEMP * 8) 
DEST[MAX_VL-1:512]^0; 


VPSRLDQ (VEX.256 and EVEX.256 encoded version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP <- 16; FI 
DEST[127:0] <- SRC[127:0] » (TEMP * 8) 

DEST[255:128] <- SRC[255:128] >> (TEMP * 8) 
DEST[MAX_VL-1:256]^0; 


VPSRLDQ (VEX.128 and EVEX.128 encoded version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP <- 16; FI 
DEST ^ SRC » (TEMP * 8) 

DEST[MAX_VL-1:128]^0; 


PSRLDQ(128-bit Legacy SSE version) 

TEMP <- COUNT 

IF (TEMP > 15) THEN TEMP <- 16; FI 
DEST ^ DEST » (TEMP * 8) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalents 

(V)PSRLDQ_ml 281 _mm_srli_si128 (_ml 281 a, int imm) 

VPSRLDQ_m256i _mm256_bsrli_epi128 (_m256i, const int) 

VPSRLDQ _m512i _mm512_bsrlLepi128 (_m512i, int) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 7. 
EVEX-encoded instruction, see Exceptions Type E4NF.nb. 
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PSRLW/PSRLD/PSRLQ-Shift Packed Data Right Logical 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

0FD1 /r' 

PSRLW mm, mm/m64 

RM 

V/V 

MMX 

Shift words in mm right by amount specified in 
mm/m64 while shifting in Os. 

66 0FD1 k 

PSRLW xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Shift words in xmml right by amount 
specified in xmm2/m 728 while shifting in Os. 

OF 71 /2ib ' 

PSRLW mm, imm8 

Ml 

V/V 

MMX 

Shift words in mm right by /mmS while shifting 
in Os. 

66 OF 71 /2ib 

PSRLW xmmi, imm8 

Ml 

v/v 

SSE2 

Shift words in xmml right by /mmS while 
shifting in Os. 

OF 02 /r' 

PSRLD mm, mm/m64 

RM 

v/v 

MMX 

Shift doublewords in mm right by amount 
specified in mm/m64 while shifting in Os. 

66 OF 02 /r 

PSRLD xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Shift doublewords in xmml right by amount 
specified in xmm2/m 728 while shifting in Os. 

OF 72 /2 lb' 

PSRLD mm, imm8 

Ml 

v/v 

MMX 

Shift doublewords in mm right by /mm8 while 
shifting in Os. 

66 OF 72 /2 lb 

PSRLD xmml, immS 

Ml 

v/v 

SSE2 

Shift doublewords in xmml right by imm8 
while shifting in Os. 

OF D3 /r' 

PSRLQ mm, mm/m64 

RM 

v/v 

MMX 

Shift mm right by amount specified in 
mm/m64 while shifting in Os. 

66 OF D3 /r 

PSRLQ xmml, xmm2/m 128 

RM 

v/v 

SSE2 

Shift guadwords in xmml right by amount 
specified in xmm2/m728while shifting in Os. 

OF 73 /2 lb' 

PSRLQ mm, imm8 

Ml 

v/v 

MMX 

Shift mm right by /mm8 while shifting in Os. 

66 OF 73 /2 lb 

PSRLQ xmml, imm8 

Ml 

v/v 

SSE2 

Shift guadwords in xmml right by /mm8 while 
shifting in Os. 

VEX.NDS.1 28.66.0F.WIC D1 /r 

VPSRLW xmml, xmm2, xmm3/ml28 

RVM 

v/v 

AVX 

Shift words in xmm2 right by amount 
specified in xmm3/m728while shifting in Os. 

VEX.NDD.128.66.0F.WIG 71 /2 lb 

VPSRLW xmml, xmm2, imm8 

VMI 

v/v 

AVX 

Shift words in xmm2 right by /mm8 while 
shifting in Os. 

VEX.NDS.1 28.66.0F.WIG D2 /r 

VPSRLD xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift doublewords in xmm2 right by amount 
specified in xmm3/m728while shifting in Os. 

VEX.NDD.128.66.0F.WIG 72 /2 lb 

VPSRLD xmm 1, xmm2, imm8 

VMI 

v/v 

AVX 

Shift doublewords in xmm2 right by imm8 
while shifting in Os. 

VEX.NDS.1 28.66.0F.WIG D3 /r 

VPSRLQ xmm 1, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Shift guadwords in xmm2 right by amount 
specified in xmm3/m728while shifting in Os. 

VEX.NDD.128.66.0F.WIG 73 /2 lb 

VPSRLQ xmm 1, xmm2, imm8 

VMI 

v/v 

AVX 

Shift guadwords in xmm2 right by /mm8 while 
shifting in Os. 

VEX.NDS.256.66.0F.WIG D1 /r 

VPSRLW ymm 1, ymm2, xmm3/m 128 

RVM 

v/v 

AVX2 

Shift words in ymm2 right by amount specified 
in xmm3/m728while shifting in Os. 

VEX.NDD.256.66.0F.WIG 71 /2 lb 

VPSRLW ymm 1, ymm2, imm8 

VMI 

v/v 

AVX2 

Shift words in ymm2 right by imm8 while 
shifting in Os. 
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VEX.NDS.256.66.0F.WIG D2 /r 

VPSRLD ymm 1, ymmZ, xmm3/m 128 

RVM 

V/V 

AVX2 

Shift doublewords in ymmZ right by amount 
specified in xmm3/m128 while shifting in Os. 

VEX.NDD.256.66.0F.WIG 72 /2 lb 

VPSRLD ymm 7, ymmZ, imm8 

VMI 

V/V 

AVX2 

Shift doublewords in ymmZ right by imm8 
while shifting in Os. 

VEX.NDS.256.66.0F.WIG D3 /r 

VPSRLQ ymm 7, ymmZ, xmm3/m 7 28 

RVM 

V/V 

AVX2 

Shift guadwords in ymmZ right by amount 
specified in xmm3/m128 while shifting in Os. 

VEX.NDD.256.66.0F.WIG 73 /2 lb 

VPSRLQ ymm 7, ymmZ, imm8 

VMI 

V/V 

AVX2 

Shift guadwords in ymmZ right by /mmS while 
shifting in Os. 

EVEX.NDS.128.66.0F.WIGD1 /r 

VPSRLW xmmi {k1}{z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512BW 

Shift words in xmm2 right by amount specified 
in xmm3/m128 while shifting in Os using 
writemask k1. 

EVEX.NDS.256.66.0F.WIG D1 /r 

VPSRLW ymmi [k1 }[z], ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512BW 

Shift words in ymm2 right by amount specified 
in xmm3/m128 while shifting in Os using 
writemask k1. 

EVEX.NDS.512.66.0F.WIGD1 /r 

VPSRLW zmmi [k1 }{z], zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512BW 

Shift words in zmm2 right by amount specified 
in xmm3/m128 while shifting in Os using 
writemask k1. 

EVEX.NDD.128.66.0F.WIG 71 /2 ib 

VPSRLW xmmi {k1}{z}, xmm2/m128, imm8 

FVM 

V/V 

AVX512VL 

AVX512BW 

Shift words in xmm2/m128 right by imm8 
while shifting in Os using writemask k1. 

EVEX.NDD.256.66.0F.WIG 71 /2 ib 

VPSRLW ymmi {k1 }[z}, ymm2/m256, imm8 

FVM 

V/V 

AVX512VL 

AVX512BW 

Shift words in ymm2/m256 right by imm8 
while shifting in Os using writemask k1. 

EVEX.NDD.512.66.0F.WIG 71 /2 ib 

VPSRLW zmmi {k1 }{z}, zmm2/m512, imm8 

FVM 

V/V 

AVX512BW 

Shift words in zmm2/m512 right by imm8 
while shifting in Os using writemask k1. 

EVEX.NDS.128.66.0F.W0 D2 /r 

VPSRLD xmmi {k1}{z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in xmm2 right by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDS.256.66.0F.W0 D2 /r 

VPSRLD ymmi {k1 }[z}, ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in ymm2 right by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDS.512.66.0F.W0 D2 /r 

VPSRLD zmmi {k1 ]{z], zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512F 

Shift doublewords in zmm2 right by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDD.128.66.0F.W0 72 /2 ib 

VPSRLD xmmi {k1}{z}, xmm2/m128/m32bcst, 
imm8 

FV 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in xmm2/m128/m32bcst 
right by imm8 while shifting in Os using 
writemask k1. 

EVEX.NDD.256.66.0F.W0 72 /2 ib 

VPSRLD ymmi {k1]{z}, ymm2/m256/m32bcst, 
imm8 

FV 

V/V 

AVX512VL 

AVX512F 

Shift doublewords in ymm2/m256/m32bcst 
right by imm8 while shifting in Os using 
writemask k1. 

EVEX.NDD.512.66.0F.W0 72 /2 ib 

VPSRLD zmmi {k1]{z}, zmm2/m512/m32bcst, 
imm8 

FVI 

V/V 

AVX512F 

Shift doublewords in zmm2/m512/m32bcst 
right by imm8 while shifting in Os using 
writemask k1. 

EVEX.NDS.128.66.0F.W1 D3/r 

VPSRLQ xmmi [k1}[z}, xmm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift guadwords in xmm2 right by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDS.256.66.0F.W1 D3 /r 

VPSRLQ ymmi {k1 }{z}, ymm2, xmm3/m128 

Ml 28 

V/V 

AVX512VL 

AVX512F 

Shift guadwords in ymm2 right by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 

EVEX.NDS.512.66.0F.W1 D3/r 

VPSRLQ zmmi {k1}{z}, zmm2, xmm3/m128 

Ml 28 

V/V 

AVX512F 

Shift guadwords in zmm2 right by amount 
specified in xmm3/m128 while shifting in Os 
using writemask k1. 
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EVEX.NDD.1 28.66.0F.W1 73 /2 lb 

VPSRLQ xmmi {k1}[z}, xmm2/m128/m64bcst, 
imm8 

FV 

V/V 

AVX512VL 

AVX512F 

Shift quadwords in xmm2/m128/m64bcst 
right by imm8 while shifting in Os using 
writemask k1. 

EVEX.NDD.256.66.0F.W1 73 /2 lb 

VPSRLQ ymmi [k1 }{z}, ymm2/m256/m64bcst, 
imm8 

FV 

V/V 

AVX512VL 

AVX512F 

Shift quadwords in ymm2/m256/m64bcst 
right by imm8 while shifting in Os using 
writemask k1. 

EVEX.NDD.512.66.0F.W1 73/2 lb 

VPSRLQ zmmi [k1}{z}, zmm2/m512/m64bcst, 
imm8 

FVI 

V/V 

AVX512F 

Shift quadwords in zmm2/m512/m64bcst 
right by imm8 while shifting in Os using 
writemask k1. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" In the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel" 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

Ml 

ModRM:r/m (r, w) 

imm8 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

VMI 

VEX.vvvv (w) 

ModRM:r/m (r) 

imm8 

NA 

FVM 

EVEX.vvvu (w) 

ModRM:r/m (R) 

Imm8 

NA 

FVI 

EVEX.vvvv (w) 

ModRM:r/m (R) 

Imm8 

NA 

M128 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Shifts the bits in the individual data elements (words, doublewords, or quadword) in the destination operand (first 
operand) to the right by the number of bits specified in the count operand (second operand). As the bits in the data 
elements are shifted right, the empty high-order bits are cleared (set to 0). If the value specified by the count 
operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand 
is set to all Os. Figure 4-19 gives an example of shifting words in a 64-bit operand. 

Note that only the low 64-bits of a 128-bit count operand are checked to compute the count. 



Figure 4-19. PSRLW, PSRLD, and PSRLQ Instruction Operation Using 64-bit Operand 


The (V)PSRLW instruction shifts each of the words in the destination operand to the right by the number of bits 
specified in the count operand; the (V)PSRLD instruction shifts each of the doublewords in the destination operand; 
and the PSRLQ instruction shifts the quadword (or quadwords) in the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instruction 64-bit operand: The destination operand is an MMX technology register; the count operand 
can be either an MMX technology register or an 64-bit memory location. 
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128-bit Legacy SSE version: The destination operand is an XMM register; the count operand can be either an XMM 
register or a 128-bit memory location, or an 8-bit immediate. If the count operand is a memory address, 128 bits 
are loaded but the upper 64 bits are ignored. Bits (VLMAX-1:128) of the corresponding VMM destination register 
remain unchanged. 

VEX.128 encoded version: The destination operand is an XMM register; the count operand can be either an XMM 
register or a 128-bit memory location, or an 8-bit immediate. If the count operand is a memory address, 128 bits 
are loaded but the upper 64 bits are ignored. Bits (VLMAX-1:128) of the destination VMM register are zeroed. 

VEX.256 encoded version: The destination operand is a VMM register. The source operand is a VMM register or a 
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit imme¬ 
diate. Bits (MAX_VL-1:256) of the corresponding ZMM register are zeroed. 

EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count 
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a 
memory location (the variable count version). For the immediate count version, the source operand (the second 
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit 
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register, 
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location. 

Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination 
register, and VEX.B/EVEX.B -i- ModRM.r/m encodes the source register. 

Note: For shifts with an immediate count (VEX.128.66.OF 71-73 /2, or EVEX.128.66.OF 71-73 /2), 

VEX.vvvv/EVEX.vvvv encodes the destination register. 

Operation 

PSRLW (with 64-bit operand) 

IF (COUNT > 15) 

THEN 

DEST[64:0] ^ OOOOOOOOOOOOOOOOH 

ELSE 

DEST[15:0] ^ ZeroExtend(DEST[15:0]» COUNT); 

(* Repeat shift operation for 2nd and 3rd words *) 

DEST[63:48] ^ ZeroExtend(DEST[63:48] » COUNT); 

FI; 

PSRLD (with 64-bit operand) 

IF (COUNT >31) 

THEN 

DEST[64:0] ^ OOOOOOOOOOOOOOOOH 

ELSE 

DEST[31:0] ^ ZeroExtend(DEST[31:0]» COUNT); 

DEST[63:32] ^ ZeroExtend(DEST[63:32] » COUNT); 

FI; 

PSRLQ (with 64-bit operand) 

IF (COUNT > 63) 

THEN 

DEST[64:0] ^ OOOOOOOOOOOOOOOOH 

ELSE 

BEST ^ ZeroExtend(DEST » COUNT); 

FI; 

L0GICAL_RIGHT_SHIFT_DW0RDS1 (SRC, C0UNT_SRC) 

COUNT <- COUNT_SRC[63:0]; 

IF (COUNT >31) 

THEN 

DEST[31:0] ^0 
ELSE 
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DEST[31:0] <- ZeroExtend(SRC[31:0] » COUNT); 

FI; 

LOGICAL_RIGHT_SHIFT_QWORDS1 (SRC, COUNT_SRC) 

COUNT <- C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[63:0] <- 0 
ELSE 

DEST[63:0] <- ZeroExtend(SRC[63:0] » COUNT); 

FI; 

LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC, COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 15) 

THEN 

DEST[255:0] ^0 
ELSE 

DEST[15:0] ^ZeroExtend(SRC[15:0] » COUNT); 

(* Repeat shift operation for 2nd through 15th words *) 
DEST[255:240] ^ZeroExtend(SRC[255:240] » COUNT); 

FI; 

LOGICAL_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 15) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[15:0] ^ZeroExtend(SRC[15:0] » COUNT); 

(* Repeat shift operation for 2nd through 7th words *) 
DEST[127:112] ^ZeroExtend(SRC[127:112] » COUNT); 

FI; 

LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC, COUNT_SRC) 
COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT >31) 

THEN 

DEST[255:0] ^0 
ELSE 

DEST[31:0] ^ZeroExtend(SRC[31:0] » COUNT); 

(* Repeat shift operation for 2nd through 3rd words *) 
DEST[255:224] ^ZeroExtend(SRC[255:224] » COUNT); 

FI; 

LOGICAL_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT >31) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[31:0] ^ZeroExtend(SRC[31:0] » COUNT); 

(* Repeat shift operation for 2nd through 3rd words *) 

DEST[127:96] ^ZeroExtend(SRC[127:96] » COUNT); 

FI; 
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LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC,COUNT_SRC) 
COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[255:0] ^0 
ELSE 

DEST[63:0] ^ZeroExtend(SRC[63:0] » COUNT); 

DEST[127:64] ^ZeroExtend(SRC[127:64] >> COUNT); 

DEST[191:128] ^ZeroExtend(SRC[191:128] >> COUNT); 
DEST[255:192] ^ZeroExtend(SRC[255:192] >> COUNT); 

FI; 

LOGICAL_RIGHT_SHIFT_QWORDS(SRC,COUNT_SRC) 

COUNT ^C0UNT_SRC[63:0]; 

IF (COUNT > 63) 

THEN 

DEST[127:0] ^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOH 
ELSE 

DEST[63:0] ^ZeroExtend(SRC[63:0] » COUNT); 

DEST[127:64] ^ZeroExtend(SRC[127:64] >> COUNT); 

FI; 
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VPSRLW (EVEX versions, xmm/ml 28) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[127:0] <- L0GICAL_RIGHT_SHIFT_W0RDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- L0GICAL_RIGHT_SHIFT_W0RDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- L0GICAL_RIGHT_SHIFT_W0RDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- L0GICAL_RIGHT_SHIFT_W0RDS_256b(SRC1 [511:256], SRC2) 
FI; 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k10] OR *no wrltemask* 

THEN DEST[i+15:1] ^ TMP_DEST[i+15:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+15:1] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VPSRLW (EVEX versions, imm8) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[127:0] <- LOGICAL_RIGHT_SHIFT_WORDS_128b(SRC1 [127:0], imm8) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1 [255:0], imm8) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1 [255:0], imm8) 
TMP_DEST[511:256] <- LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC1 [511:256], imm8) 
FI; 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k10] OR *no wrltemask* 

THEN DEST[i+15:1] ^ TMP_DEST[i+15:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+15:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 
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VPSRLW (ymm, ymm, xmm/ml 28) - VEX.ZSS encoding 

DEST[255:0] ^L0GICAL_RIGHT_SHIFT_W0RDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256]^0; 


VPSRLW (ymm, immS) - VEX.256 encoding 

DEST[255:0] ^L0GICAL_RIGHT_SHIFT_W0RDS_256b(SRC1, ImmS) 
DEST[MAX_VL-1:256] ^0; 


VPSRLW (xmm, xmm, xmm/ml 28) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_W0RDS(SRC1, SRC2) 
DEST[MAX_VL-1:128] ^0 


VPSRLW (xmm, imm8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_W0RDS(SRC1, imm8) 
DEST[MAX_VL-1:128] ^0 


PSRLW (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_W0RDS(DEST, SRC) 
DEST[MAX_VL-1:128] (Unmodified) 


PSRLW (xmm, imm8) 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_W0RDS(DEST, imm8) 
DEST[MAX_VL-1:128] (Unmodified) 


VPSRLD (EVEX versions, xmm/ml 28) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

IFVL= 128 

TMP_DEST[127:0] <- L0GICAL_RIGHT_SHIFT_DW0RDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- L0GICAL_RIGHT_SHIFT_DW0RDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- L0GICAL_RIGHT_SHIFT_DW0RDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- L0GICAL_RIGHT_SHIFT_DW0RDS_256b(SRC1 [511:256], SRC2) 
FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 
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VPSRLD (EVEX versions, irnmS) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR) ^0 TO KL-1 
i^j*32 

IF k10] OR *no wrltemask* THEN 

IF (EVEX.b = 1) AND (SRC1 *is memory*) 

THEN DEST[i+31:l] ^ L0GICAL_RIGHT_SHIFT_DW0RDS1(SRC1[31:0], Imm8) 
ELSE DEST[I+31 :l] ^ LOGICAL_RIGHT_SHIFT_DWORDS1 (SRC1 [1+31 :i], Imm8) 
FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+31:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 

VPSRLD (ymm, ymm, xmm/ml 28) - VEX.256 encoding 

DEST[255:0] ^LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1, SRC2) 

DEST[MAX_VL-1:256] ^0; 

VPSRLD (ymm, immS) - VEX.256 encoding 

DEST[255:0] ^LOGICAL_RIGHT_SHIFT_DWORDS_256b(SRC1, imm8) 

DEST[MAX_VL-1:256] ^0; 

VPSRLD (xmm, xmm, xmm/ml 28) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_DW0RDS(SRC1, SRC2) 

DEST[MAX_VL-1:128] ^0 

VPSRLD (xmm, imm8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_DW0RDS(SRC1, imm8) 

DEST[MAX_VL-1:128] ^0 

PSRLD (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^LOGICAL_RIGHT_SHIFT_DWORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

PSRLD (xmm, imm8) 

DEST[127:0] ^LOGICAL_RICHT_SHIFT_DWORDS(DEST, imm8) 

DEST[MAX_VL-1:128] (Unmodified) 
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VPSRLQ (EVEX versions, xmm/ml 28) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

TMP_DEST[255:0] <- LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- L0GICAL_RIGHT_SHIFT_QW0RDS_256b(SRC1 [511:256], SRC2) 
IFVL= 128 

TMP_DEST[127:0] <- L0GICAL_RIGHT_SHIFT_QW0RDS_128b(SRC1 [127:0], SRC2) 

FI; 

IFVL= 256 

TMP_DEST[255:0] <- L0GICAL_RIGHT_SHIFT_QW0RDS_256b(SRC1 [255:0], SRC2) 

FI; 

IFVL= 512 

TMP_DEST[255:0] <- L0GICAL_RIGHT_SHIFT_QW0RDS_256b(SRC1 [255:0], SRC2) 
TMP_DEST[511:256] <- LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1 [511:256], SRC2) 
FI; 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TMP_DEST[I+63:I] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 


VPSRLQ (EVEX versions, imm8) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 G] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC1 *is memory*) 

THEN DEST[i+63:i] ^ L0GICAL_RIGHT_SHIFT_QW0RDS1(SRC1[63:0], imm8) 
ELSE DEST[i+63:i] ^ LOGICAL_RIGHT_SHIFT_QWORDS1(SRC1 [1+63:1], imm8) 
FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL] ^0 


VPSRLQ (ymm, ymm, xmm/ml 28) - VEX.256 encoding 

DEST[255:0] ^LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256] ^0; 


VPSRLQ (ymm, imm8) - VEX.256 encoding 

DEST[255:0] ^LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC1, imm8) 
DEST[MAX_VL-1:256]^0; 
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VPSRLQ (xmm, xmm, xmm/ml Z8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_QW0RDS(SRC1, SRC2) 
DEST[MAX_VL-1:1281^0 


VPSRLQ (xmm, imm8) - VEX.128 encoding 

DEST[127:0] ^L0GICAL_RIGHT_SHIFT_QW0RDS(SRC1, imm8) 
DEST[MAX_VL-1:1281^0 


PSRLQ (xmm, xmm, xmm/ml 28) 

DEST[127:0] ^LOGICAL_RIGHT_SHIFT_QWORDS(DEST, SRC) 
DEST[MAX_VL-1:128] (Unmodified) 


PSRLQ (xmm, imm8) 

DEST[127:0] ^LOGICAL_RIGHT_SHIFT_QWORDS(DEST, imm8) 
DEST[MAX_VL-1:128] (Unmodified) 


Intel C/C++ Compiler Intrinsic Equivalents 

VPSRLD_mSI 2i _mm512_srli_epi32(_m512i a, unsigned int imm); 

VPSRLD_mSI 2i _mm512_mask_srli_epi32(_mSI 2i s,_mmaski 6 k,_m512i a, unsigned int imm); 

VPSRLD_mSI 2i _mm512_maskz_srli_epi32(_mmaski 6 k,_mSI 2i a, unsigned int imm); 

VPSRLD_m256i _mm256_mask_srli_epi32(_m256i s,_mmask8 k,_m256i a, unsigned int imm); 

VPSRLD_m256i _mm256_maskz_srli_epi32(_mmask8 k,_m256i a, unsigned int imm); 

VPSRLD_ml 28i _mm_mask_srli_epi32(_ml 28i s,_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRLD_ml 28i _mm_maskz_srli_epi32(_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRLD _m512i _mm512_srLepi32(_m512i a, _m128i cnt); 

VPSRLD_mSI 2i _mm512_mask_srl_epi32(_mSI 2i s,_mmaski 6 k,_m512i a,_ml 28i cnt); 

VPSRLD_mSI 2i _mm512_maskz_srl_epi32(_mmaski 6 k,_mSI 2i a,_ml 28i cnt); 

VPSRLD_m256i _mm256_mask_srl_epi32(_m256i s,_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRLD_m256i _mm256_maskz_srl_epi32(_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRLD_ml 28i _mm_mask_srl_epi32(_m128i s,_mmask8 k,_m128i a,_m128i cnt); 

VPSRLD_ml 28i _mm_maskz_srl_epi32(_mmask8 k,_m128i a,_m128i cnt); 

VPSRLQ_m512i_mm512_srli_epi64(_mSIZi a, unsigned int imm); 

VPSRLQ_m512i _mm512_mask_srli_epi64(_mSI 2i s,_mmask8 k,_mSI 2i a, unsigned int imm); 

VPSRLQ_m512i _mm512_mask_srli_epi64(_mmask8 k,_mSI 2i a, unsigned int imm); 

VPSRLQ_m256i _mm256_mask_srli_epi64(_m256i s,_mmask8 k,_m256i a, unsigned int imm); 

VPSRLQ_m256i _mm256_maskz_srli_epi64(_mmask8 k,_m256i a, unsigned int imm); 

VPSRLQ_ml 28i _mm_mask_srli_epi64(_ml 28i s,_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRLQ_ml 28i _mm_maskz_srli_epi64(_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRLQ _m512i _mm512_srLepi64(_m512i a, _m128i cnt); 

VPSRLQ_mSI 2i _mm512_mask_srl_epi64(_mSIZi s,_mmask8 k,_mSIZi a,_m128i cnt); 

VPSRLQ_m512i_mm512_mask_srl_epi64(_mmask8 k,_mSIZi a,_m128i cnt); 

VPSRLQ_m256i _mm256_mask_srl_epi64(_m256i s,_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRLQ_m256i _mm256_maskz_srl_epi64(_mmask8 k,_m256i a,_ml 28i cnt); 

VPSRLQ_ml 28i _mm_mask_srl_epi64(_ml 28i s,_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSRLQ_ml 28i _mm_maskz_srl_epi64(_mmask8 k,_ml 28i a,_ml 28i cnt); 

VPSRLW_mSI 2i_mm512_srli_epi16(_mSI 2i a, unsigned int imm); 

VPSRLW_mSI 2i_mm512_mask_srli_epi16(_mSI 2i s,_mmask32 k,_mSI 2i a, unsigned int imm); 

VPSRLW_mSI 2i_mm512_maskz_srli_epi16(_mmask32 k,_mSI 2i a, unsigned int imm); 

VPSRLW_m256i_mm256_mask_srlii_epi16(_m256i s,_mmaski 6 k,_m256i a, unsigned int imm); 

VPSRLW_m256i_mm256_maskz_srli_epi16(_mmaski 6 k,_m256i a, unsigned int imm); 

VPSRLW_ml 28i_mm_mask_srli_epi16(_m128i s,_mmask8 k,_ml 28i a, unsigned int imm); 

VPSRLW_m128i_mm_maskz_srli_epi16(_mmask8 k,_m128i a, unsigned int imm); 

VPSRLW _m512i_mm512_srLepi16(_m512i a,_m128i cnt); 

VPSRLW_mSIZi_mm512_mask_srl_epi16(_mS12i s,_mmask32 k,_mS12i a,_m128i cnt); 
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VPSRLW_mSI 2i _mm512_maskz_srl_epl16(_mmaskSZ k,_mSI 21 a,_ml 281 cnt); 

VPSRLW_m256i_mm256_mask_srl_epl16(_m256l s,_mmask16 k,_m256l a,_ml 281 cnt); 

VPSRLW_m256i _mm256_maskz_srl_epl16(_mmask8 k,_mmaski 6 a,_ml 281 cnt); 

VPSRLW_ml 281 _mm_mask_srl_epl16(_ml 281 s,_mmask8 k,_ml 281 a,_ml 281 cnt); 

VPSRLW_ml 281 _mm_maskz_srl_epi16(_mmask8 k,_ml 281 a,_ml 281 cnt); 

PSRLW:_m64 _mm_srli_pl16(_m64 m, Int count) 

PSRLW:_m64 _mm_srl_pi16 (_m64 m,_m64 count) 

(V)PSRLW:_ml 281 _mm_srll_epi16 (_ml 281 m, Int count) 

(V)PSRLW:_ml 281 _mm_srl_epl16 (_ml 281 m,_ml 281 count) 

VPSRLW:_m256i _mm256_srll_epl16 (_m256i m, int count) 

VPSRLW:_m256i _mm256_srl_epl16 (_m256i m,_ml 281 count) 

PSRLD:_m64_mm_srli_pi32 (_m64 m, Int count) 

PSRLD:_m64 _mm_srl_pi32 (_m64 m,_m64 count) 

(V)PSRLD:_ml 281 _mm_srli_epi32 (_ml 281 m, Int count) 

(V)PSRLD:_ml 281 _mm_srl_epi32 (_ml 281 m,_ml 28i count) 

VPSRLD:_m256l _mm256_srli_epi32 (_m256l m, Int count) 

VPSRLD:_m256i _mm256_srl_epi32 (_m256l m,_ml 28i count) 

PSRLQ:_m64_mm_srll_si64 (_m64 m, Int count) 

PSRLQ:_m64 _mm_srl_sl64 (_m64 m,_m64 count) 

(V)PSRLQ:_ml 281 _mm_srli_epi64 (_ml 281 m, Int count) 

(V)PSRLQ:_ml 281 _mm_srl_epi64 (_ml 281 m,_ml 28i count) 

VPSRLQ:_m256i _mm256_srll_epl64 (_m256i m, Int count) 

VPSRLQ:_m256i _mm256_srl_epl64 (_m256i m,_ml 281 count) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

VEX-encoded instructions: 

Syntax with RM/RVM operand encoding, see Exceptions Type 4. 
Syntax with MI/VMI operand encoding, see Exceptions Type 7. 

EVEX-encoded VPSRLW, see Exceptions Type E4NF.nb. 

EVEX-encoded VPSRLD/Q: 

Syntax with M128 operand encoding, see Exceptions Type E4NF.nb. 
Syntax with FVI operand encoding, see Exceptions Type E4. 
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PSUBB/PSUBW/PSUBD-Subtract Packed Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF F8 /r' 

PSUBB mm, mm/m64 

RM 

V/V 

MMX 

Subtract packed byte integers in mm/m64 
from packed byte integers in mm. 

66 OF F8 /r 

PSUBB xmml, xmm2/ml28 

RM 

v/v 

SSE2 

Subtract packed byte integers in xmm2/m 128 
from packed byte integers in xmml. 

OF F9 /r' 

PSUBW mm, mm/m64 

RM 

V/V 

MMX 

Subtract packed word integers in mm/m64 
from packed word integers in mm. 

66 OF F9 /r 

PSUBW xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Subtract packed word integers in 
xmm2/m 7 28 from packed word integers in 
xmml. 

OF FA /r' 

PSUBD mm, mm/m64 

RM 

v/v 

MMX 

Subtract packed doubleword integers in 
mm/m64 from packed doubleword integers in 
mm. 

66 OF FA Ir 

PSUBD xmml, xmm2/ml28 

RM 

v/v 

SSE2 

Subtract packed doubleword integers in 
xmm2/mem 7 28 from packed doubleword 
integers in xmml. 

VEX.NDS.128.66.0F.WIG F8 /r 

VPSUBB xmm 1, xmm2, xmm3/m 7 28 

RVM 

v/v 

AVX 

Subtract packed byte integers in xmm3/ml28 
from xmm2. 

VEX.NDS.128.66.0F.WIG F9 /r 

VPSUBW xmm 1, xmm2, xmm3/m 7 28 

RVM 

v/v 

AVX 

Subtract packed word integers in 
xmm3/m 128 from xmm2. 

VEX.NDS.128.66.0F.WIGFA/r 

VPSUBD xmm 7, xmm2, xmm3/m 128 

RVM 

v/v 

AVX 

Subtract packed doubleword integers in 
xmm3/m 128 from xmm2. 

VEX.NDS.256.66.0F.WIG F8 /r 

VPSUBB ymm 7, ymm2, \/mm3/m256 

RVM 

v/v 

AVX2 

Subtract packed byte integers in ymm3/m256 
from ymm2. 

VEX.NDS.256.66.0F.WIG F9 /r 

VPSUBW ymml, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Subtract packed word integers in 
ymm3/m256 from ymm2. 

VEX.NDS.256.66.0F.WIGFA/r 

VPSUBD ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Subtract packed doubleword integers in 
ymm3/m256 from ymm2. 

EVEX.NDS.1 28.66.0F.WIG F8 /r 

VPSUBB xmml {k1}[z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed byte integers in xmm3/m128 
from xmm2 and store in xmml using 
writemask k1. 

EVEX.NDS.256.66.0F.WIG F8 /r 

VPSUBB ymml [k1}[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed byte integers in ymm3/m256 
from ymm2 and store in ymml using 
writemask k1. 

EVEX.NDS.512.66.0F.WIG F8 /r 

VPSUBB zmmi {k1}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Subtract packed byte integers in zmm3/m512 
from zmm2 and store in zmmi using 
writemask k1. 

EVEX.NDS.1 28.66.0F.WIG F9 /r 

VPSUBW xmml {k1 }[z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed word integers in 
xmm3/m128 from xmm2 and store in xmml 
using writemask k1. 

EVEX.NDS.256.66.0F.WIG F9 /r 

VPSUBW ymml [k1}[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed word integers in 
ymm3/m256 from ymm2 and store in ymml 
using writemask k1. 

EVEX.NDS.512.66.0F.WIG F9 /r 

VPSUBW zmmi {k1 }[z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Subtract packed word integers in 
zmm3/m512 from zmm2 and store in zmmi 
using writemask k1. 
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EVEX.NDS.1 28.66.0F.W0 FA /r 

VPSUBD xmmi [k1 }[z}, xmm2, xmm3/m128/m32bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Subtract packed doubleword integers in 
xmm3/m128/m32bcst from xmm2 and store 
in xmmi using writemask k1. 

EVEX.NDS.256.66.0F.W0 FA /r 

VPSUBD ymmi [k1}[z}, ymm2, ymm3/m256/m32bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Subtract packed doubleword integers in 
ymm3/m256/m32bcst from ymm2 and store 
in ymmi using writemask k1. 

EVEX.NDS.512.66.0F.W0FA/r 

VPSUBD zmmi [k1 }[z}, zmm2, zmm3/m512/m32bcst 

FV 

V/V 

AVX512F 

Subtract packed doubleword integers in 
zmm3/m512/m32bcst from zmm2 and store 
in zmmi using writemask k1 


NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" in 
the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvuv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD subtract of the packed integers of the source operand (second operand) from the packed integers 
of the destination operand (first operand), and stores the packed integer results in the destination operand. See 
Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of 
a SIMD operation. Overflow is handled with wraparound, as described in the following paragraphs. 

The (V)PSUBB instruction subtracts packed byte integers. When an individual result is too large or too small to be 
represented in a byte, the result is wrapped around and the low 8 bits are written to the destination element. 

The (V)PSUBW instruction subtracts packed word integers. When an individual result is too large or too small to be 
represented in a word, the result is wrapped around and the low 16 bits are written to the destination element. 

The (V)PSUBD instruction subtracts packed doubleword integers. When an individual result is too large or too small 
to be represented in a doubleword, the result is wrapped around and the low 32 bits are written to the destination 
element. 

Note that the (V)PSUBB, (V)PSUBW, and (V)PSUBD instructions can operate on either unsigned or signed (two's 
complement notation) packed integers; however, it does not set bits in the EFLAGS register to indicate overflow 
and/or a carry. To prevent undetected overflow conditions, software must control the ranges of values upon which 
it operates. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source 
operand can be either an MMX technology register or a 64-bit memory location. 

128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM desti¬ 
nation register remain unchanged. 

VEX. 128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 


4-470 Vol. 2B 


PSUBB/PSUBW/PSUBD-Subtract Packed Integers 


















INSTRUCTION SET REFERENCE, M-U 


VEX.256 encoded versions: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding ZMM 
register are zeroed. 

EVEX encoded VPSUBD: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca¬ 
tion or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source operand and 
destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with writemask kl. 

EVEX encoded VPSUBB/W: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory 
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi¬ 
tionally updated with writemask kl. 

Operation 

PSUBB (with 64-bit operands) 

DEST[7:0] ^ DEST[7:0] - SRC[7:0]; 

(* Repeat subtract operation for 2nd through 7th byte *) 

DEST[63:56] ^ DEST[63:56] - SRC[63:56]; 

PSUBW (with 64-bit operands) 

DEST[15:0] ^ DEST[15:0] - SRC[15:0]; 

(* Repeat subtract operation for 2nd and 3rd word *) 

DEST[63:48] ^ DEST[63:48] - SRC[63:48]; 

PSUBD (with 64-bit operands) 

DEST[31:0] ^ DEST[31:0] - SRC[31:0]; 

DEST[63:32] ^ DEST[63:32] - SRC[63:32]; 

PSUBD (with 128-bit operands) 

DEST[31:0] ^ DEST[31:0] - SRC[31:0]; 

(* Repeat subtract operation for 2nd and 3rd doubleword *) 

DEST[127:96] ^ DEST[127:96] - SRC[127:96]; 

VPSUBB (EVEX encoded versions) 

(KL, VL) = (16,128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^J*8 

IF kl 0] OR *no writemask* 

THEN DEST[i+7:i] ^ SRC1 [i+7:i] - SRC2[i+7:i] 

ELSE 


IF *merging-masking* ; men 

THEN *DEST[i+7:i] remains unchanged 
ELSE *zeroing-masking* ; 


; merging-masking 


; zeroing-masking 


DEST[i+7:i] = 0 
FI 


FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

VPSUBW (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^J* 16 

IF kl 0] OR *no writemask* 

THEN DEST[i+15:1] ^ SRC1 [i+15:i] - SRC2[i+15:i] 
ELSE 


IF *merging-masking* 


; merging-masking 
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THEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+15:i] = 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

VPSUBD (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN DEST[I+31 :i] ^ SRC1 [i+31 :i] - SRC2[31:0] 

ELSE DEST[i+31 :i] ^ SRC1 [i+31 :i] - SRC2[i+31 :i] 

FI; 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 


VPSUBB (VEX.256 encoded version) 

DEST[7:0] ^SRCI [7:0]-SRC2[7:0] 

DEST[15:8] ^SRCI [15:8]-SRC2[15:8] 

DEST[23:16] ^SRCI [23:16]-SRC2[23:16] 

DEST[31:24] ^SRCI [31:24]-SRC2[31:24] 
DEST[39:32] ^SRCI [39:32]-SRC2[39:32] 
DEST[47:40] ^SRCI [47:40]-SRC2[47:40] 
DEST[55:48] ^SRCI [55:48]-SRC2[55:48] 
DEST[63:56] ^SRCI [63:56]-SRC2[63:56] 

DEST[71:64] ^SRCI [71:64]-SRC2[71:64] 
DEST[79:72] ^SRCI [79:72]-SRC2[79:72] 
DEST[87:80] ^SRCI [87:80]-SRC2[87:80] 
DEST[95:88] ^SRCI [95:88]-SRC2[95:88] 

DEST[103:96] ^SRCI [103:96]-SRC2[103:96] 
DEST[111:104] ^SRCI [111:104]-SRC2[111:104] 
DEST[119:112] ^SRCI [119:112]-SRC2[119:112] 
DEST[127:120] ^SRCI [127:120]-SRC2[127:120] 
DEST[135:128] ^SRCI [135:128]-SRC2[135:128] 
DEST[143:136] ^SRCI [143:136]-SRC2[143:136] 
DEST[151:144] ^SRCI [151:144]-SRC2[151:144] 
DEST[159:152] ^SRCI [159:152]-SRC2[159:152] 
DEST[167:160] ^SRCI [167:160]-SRC2[167:160] 
DEST[175:168] ^SRCI [175:168]-SRC2[175:168] 
DEST[183:176] ^SRCI [183:176]-SRC2[183:176] 
DEST[191:184] ^SRCI [191:184]-SRC2[191:184] 
DEST[199:192] ^SRCI [199:192]-SRC2[199:192] 
DEST[207:200] ^SRCI [207:200]-SRC2[207:200] 
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DEST[215:208] ^SRCI [215:208]-SRC2[215:208] 
DEST[223:216] ^SRCI [223:216]-SRC2[223:216] 
DEST[231:224] ^SRCI [231:224]-SRC2[231:224] 
DEST[239:232] ^SRCI [239:232]-SRC2[239:232] 
DEST[247:240] ^SRCI [247:240]-SRC2[247:240] 
DEST[255:248] ^SRCI [255:248]-SRC2[255:248] 
DEST[MAX_VL-1:256] ^0 


VPSUBB (VEX.128 encoded version) 

DEST[7:0] ^SRCI [7:0]-SRC2[7:0] 

DEST[15:8] ^SRCI [15:8]-SRC2[15:8] 

DEST[23:16] ^SRCI [23:16]-SRC2[23:16] 

DEST[31:24] ^SRCI [31:24]-SRC2[31:24] 
DEST[39:32] ^SRCI [39:32]-SRC2[39:32] 
DEST[47:40] ^SRCI [47:40]-SRC2[47:40] 
DEST[55:48] ^SRCI [55:48]-SRC2[55:48] 
DEST[63:56] ^SRCI [63:56]-SRC2[63:56] 

DEST[71:64] ^SRCI [71:64]-SRC2[71:64] 
DEST[79:72] ^SRCI [79:72]-SRC2[79:72] 
DEST[87:80] ^SRCI [87:80]-SRC2[87:80] 
DEST[95:88] ^SRCI [95:88]-SRC2[95:88] 

DEST[103:96] ^SRCI [103:96]-SRC2[103:96] 
DEST[111:104] ^SRCI [111:104]-SRC2[111:104] 
DEST[119:112] ^SRCI [119:112]-SRC2[119:112] 
DEST[127:120] ^SRCI [127:120]-SRC2[127:120] 
DEST[MAX_VL-1:128] ^0 

PSUBB (128-bit Legacy SSE version) 

DEST[7:0] ^DEST[7:0]-SRC[7:0] 

DEST[15:8] ^DEST[15:8]-SRC[15:8] 

DEST[23:16] ^DEST[23:16]-SRC[23:16] 

DEST[31:24] ^DEST[31:24]-SRC[31:24] 
DEST[39:32] ^DEST[39:32]-SRC[39:32] 
DEST[47:40] ^DEST[47:40]-SRC[47:40] 
DEST[55:48] ^DEST[55:48]-SRC[55:48] 
DEST[63:56] ^DEST[63:56]-SRC[63:56] 

DEST[71:64] ^DEST[71:64]-SRC[71:64] 
DEST[79:72] ^DEST[79:72]-SRC[79:72] 
DEST[87:80] ^DEST[87:80]-SRC[87:80] 
DEST[95:88] ^DEST[95:88]-SRC[95:88] 

DEST[103:96] ^DEST[103:96]-SRC[103:96] 
DEST[111:104] ^DEST[111:104]-SRC[111:104] 
DEST[119:112] ^DEST[119:112]-SRC[119:112] 
DEST[127:120] ^DEST[127:120]-SRC[127:120] 
DEST[MAX_VL-1:128] (Unmodified) 

VPSUBW (VEX.256 encoded version) 

DEST[15:0] ^SRCI [15:0]-SRC2[15:0] 

DEST[31:16] ^SRCI [31:16]-SRC2[31:16] 
DEST[47:32] ^SRCI [47:32]-SRC2[47:32] 
DEST[63:48] ^SRCI [63:48]-SRC2[63:48] 
DEST[79:64] ^SRCI [79:64]-SRC2[79:64] 
DEST[95:80] ^SRCI [95:80]-SRC2[95:80] 

DEST[111:96] ^SRCI [111:96]-SRC2[111:96] 
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DEST[127:112] ^SRCI [127:112]-SRC2[127:112] 
DEST[143:128] ^SRCI [143:128]-SRC2[143:128] 
DEST[159:144] ^SRCI [159:144]-SRC2[159:144] 
DEST[175:160] ^SRCI [175:160]-SRC2[175:160] 
DEST[191:176] ^SRCI [191:176]-SRC2[191:176] 
DEST[207:192] ^SRCI 207:192]-SRC2[207:192] 
DEST[223:208] ^SRCI [223:208]-SRC2[223:208] 
DEST[239:224] ^SRCI [239:224]-SRC2[239:224] 
DEST[255:240] ^SRCI [255:240]-SRC2[255:240] 
DEST[MAX_VL-1:256] ^0 


VPSUBW (VEX.128 encoded version) 

DEST[15:0] ^SRCI [15:0]-SRC2[15:0] 

DEST[31:16] ^SRCI [31:16]-SRC2[31:16] 
DEST[47:32] ^SRCI [47:32]-SRC2[47:32] 
DEST[63:48] ^SRCI [63:48]-SRC2[63:48] 
DEST[79:64] ^SRCI [79:64]-SRC2[79:64] 
DEST[95:80] ^SRCI [95:80]-SRC2[95:80] 

DEST[111:96] ^SRCI [111:96]-SRC2[111:96] 
DEST[127:112] ^SRCI [127:112]-SRC2[127:112] 
DEST[MAX_VL-1:128] ^0 


PSUBW (128-bit Legacy SSE version) 

DEST[15:0] ^DEST[15:0]-SRC[15:0] 

DEST[31:16] ^DEST[31:16]-SRC[31:16] 
DEST[47:32] ^DEST[47:32]-SRC[47:32] 
DEST[63:48] ^DEST[63:48]-SRC[63:48] 
DEST[79:64] ^DEST[79:64]-SRC[79:64] 
DEST[95:80] ^DEST[95:80]-SRC[95:80] 

DEST[111:96] ^DEST[111:96]-SRC[111:96] 
DEST[127:112] ^DEST[127:112]-SRC[127:112] 
DEST[MAX_VL-1:128] (Unmodified) 


VPSUBD (UEX.256 encoded version) 

DEST[31:0] ^SRCI [31:0]-SRC2[31:0] 
DEST[63:32] ^SRCI [63:32]-SRC2[63:32] 
DEST[95:64] ^SRCI [95:64]-SRC2[95:64] 

DEST[127:96] ^SRCI [127:96]-SRC2[127:96] 
DEST[159:128] ^SRCI [159:128]-SRC2[159:128] 
DEST[191:160] ^SRCI [191:160]-SRC2[191:160] 
DEST[223:192] ^SRCI [223:192]-SRC2[223:192] 
DEST[255:224] ^SRCI [255:224]-SRC2[255:224] 
DEST[MAX_VL-1:256] ^0 


VPSUBD {VEX.128 encoded version) 

DEST[31:0] ^SRCI [31:0]-SRC2[31:0] 
DEST[63:32] ^SRCI [63:32]-SRC2[63:32] 
DEST[95:64] ^SRCI [95:64]-SRC2[95:64] 
DEST[127:96] ^SRCI [127:96]-SRC2[127:96] 
DEST[MAX_VL-1:128] ^0 


PSUBD (128-bit Legacy SSE version) 

DEST[31:0] ^DEST[31:0]-SRC[31:0] 
DEST[63:32] ^DEST[63:32]-SRC[63:32] 
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DEST[95:64] ^DEST[95:64]-SRC[95:64] 

DEST[127:96] ^DEST[127:96]-SRC[127:96] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalents 

VPSUBB _m5121 _mm512_sub_epi8(_m512i a,_m5121 b); 

VPSUBB_mSI 21 _mm512_mask_sub_epi8(_mSI 21 s,_mmask64 k,_mSI 21 a,_mSI 21 b); 

VPSUBB_mSI 21 _mm512_maskz_sub_epi8(_mmask64 k,_mSI 21 a,_mSI 2i b); 

VPSUBB_m256i _mm256_mask_sub_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPSUBB_m256i _mm256_maskz_sub_epi8(_mmask32 k,_m256i a,_m256i b); 

VPSUBB_ml 281 _mm_mask_sub_epi8(_ml 281 s,_mmask16 k,_ml 281 a,_m128i b); 

VPSUBB_ml 281 _mm_maskz_sub_epi8(_mmask16 k,_ml 281 a,_ml 281 b); 

VPSUBW _m5121 _mm512_sub_epi16(_m5121 a,_m512i b); 

VPSUBW_m512i_mm512_mask_sub_epi16(_m512i s,_mmask32 k,_m512i a,_m512i b); 

VPSUBW_m512i_mm512_maskz_sub_epi16(_mmask32 k,_m512i a,_m512i b); 

VPSUBW_m256i_mm256_mask_sub_epi16(_m256i s,_mmask16 k,_m256i a,_m256i b); 

VPSUBW_m256i_mm256_maskz_sub_epi16(_mmask16 k,_m256i a,_m256i b); 

VPSUBW_ml 281 _mm_mask_sub_epi16(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPSUBW_ml 281 _mm_maskz_sub_epi16(_mmaskB k,_ml 281 a,_ml 281 b); 

VPSUBD _m5121 _mm512_sub_epi32(_m512i a, _m5121 b); 

VPSUBD_mSI 21 _mm512_mask_sub_epi32(_mSI 21 s,_mmaski 6 k,_mSI 2i a,_mSI 21 b); 

VPSUBD_m512i_mm512_maskz_sub_epi32(_mmaski 6 k,_m512i a,_m512i b); 

VPSUBD_m256i _mm256_mask_sub_epi32(_m256i s,_mmaskB k,_m256i a,_m256i b); 

VPSUBD_m256i _mm256_maskz_sub_epi32(_mmaskB k,_m256i a,_m256i b); 

VPSUBD_ml 28i _mm_mask_sub_epi32(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPSUBD_ml 281 _mm_maskz_sub_epi32(_mmaskB k,_ml 281 a,_ml 281 b); 

PSUBB:_m64 _mm_sub_pi8(_m64 ml,_m64 m2) 

(V)PSUBB:_m128i_mm_sub_epi8 (_ml 281 a,_ml 281 b) 

VPSUBB:_m256i _mm256_sub_epi8 (_m256i a, _m256i b) 

PSUBW:_m64_mm_sub_pi16(_m64 ml,_m64 m2) 

(V)PSUBW:_m1281 _mm_sub_epi16 (_m128i a, _m1281 b) 

VPSUBW:_m256i _mm256_sub_epi16 (_m256i a, _m256i b) 

PSUBD:_m64 _mm_sub_pi32(_m64 ml,_m64 m2) 

(V)PSUBD:_m128i_mm_sub_epi32 (_m128i a,_m128i b) 

VPSUBD:_m256i _mm256_sub_epi32 (_m256i a, _m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded VPSUBD, see Exceptions Type E4. 

EVEX-encoded VPSUBB/W, see Exceptions Type E4.nb. 
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PSUBQ—Subtract Packed Quadword Integers 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF FB /r' 

PSUBQ mml, mmZ/m64 

RM 

V/V 

SSE2 

Subtract quadword Integer In mml from mmZ 
/m64. 

66 OF FB /r 

PSUBQ xmm 7, xmmZ/m 7 Z8 

RM 

v/v 

SSE2 

Subtract packed quadword Integers In xmml 
from xmmZ /m 128. 

VEX.NDS.128.66.0F.WIG FB/r 

VPSUBQ xmm 1, xmmZ, xmm3/m 7 Z8 

RVM 

V/V 

AVX 

Subtract packed quadword Integers In 
xmm3/m 128 from xmmZ. 

VEX.NDS.256.66.0F.WIG FB /r 

VPSUBQ ymm 7, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Subtract packed quadword Integers In 
ymm3/m256 from ymmZ. 

EVEX.NDS.128.66.0F.W1 FB/r 

VPSUBQ xmmi {k1 }{z}, xmm2, xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Subtract packed quadword Integers In 
xmm3/m128/m64bcst from xmm2 and store 

In xmml using wrltemask k1. 

EVEX.NDS.256.66.0F.W1 FB It 

VPSUBQ ymmi {l<1]{z}, ymm2, ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Subtract packed quadword Integers In 
ymm3/m256/m64bcst from ymm2 and store 

In ymmi using wrltemask k1. 

EVEX.NDS.512.66.0F.W1 FB/r 

VPSUBQ zmmi [k1 }{z], zmm2, zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Subtract packed quadword integers in 
zmm3/m512/m64bcst from zmm2 and store 
in zmmi using wrltemask k1. 


NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel* 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel* 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Qp/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.uvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Subtracts the second operand (source operand) from the first operand (destination operand) and stores the result 
in the destination operand. When packed quadword operands are used, a SIMD subtract is performed. When a 
quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 
bits are written to the destination element (that is, the carry is ignored). 

Note that the (V)PSUBQ instruction can operate on either unsigned or signed (two's complement notation) inte¬ 
gers; however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected 
overflow conditions, software must control the ranges of the values upon which it operates. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The source operand can be a quadword integer stored in an MMX technology 
register or a 64-bit memory location. 

128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM desti¬ 
nation register remain unchanged. 
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VEX. 128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 

VEX.256 encoded versions: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding ZMM 
register are zeroed. 

EVEX encoded VPSUBQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca¬ 
tion or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source operand and 
destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with writemask kl. 

Operation 

PSUBQ (with 64-Bit operands) 

DEST[63:0] ^ DEST[63:0] - SRC[63:0]; 

PSUBQ (with 128-Bit operands) 

DEST[63:0] ^ DEST[63:0] - SRC[63:0]; 

DEST[127:64] ^ DEST[127:64] - SRC[127:64]; 

VPSUBQ (VEX.128 encoded version) 

DEST[63:0] ^ SRC1[63:0]-SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64]-SRC2[127:64] 

DEST[VLMAX-1:128]^0 

VPSUBQ (VEX.256 encoded version) 

DEST[63:0] ^ SRC1[63:0]-SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64]-SRC2[127:64] 

DEST[191:128] ^ SRC1 [191:128]-SRC2[191:128] 

DEST[255:192] ^ SRC1 [255:192]-SRC2[255:192] 

DEST[VLMAX-1:256]^0 

VPSUBQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ J * 64 

IF kl 0] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN DEST[i+63:l] ^ SRC1 [i+63:i] - SRC2[63:0] 

ELSE DEST[I+63:I] ^ SRC1 [i+63:l] - SRC2[l+63:i] 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPSUBQ _m5121 _mm512_sub_epi64(_m5121 a,_m512i b); 

VPSUBQ_m5121 _mm512_mask_sub_epi64(_m512i s,_mmaskS k,_m5121 a,_m512i b); 

VPSUBQ_m512i _mm512_maskz_sub_epi64(_mmask8 k,_m512i a,_m512i b); 

VPSUBQ_m256i _mm256_mask_sub_epi64(_m256i s,_mmaskS k,_m256i a,_m256i b); 
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VPSUBQ_m256l _mm256_maskz_sub_epi64(_mmaskS k,_m256l a,_m256i b); 

VPSUBQ_ml 281 _mm_mask_sub_epl64(_ml 281 s,_mmaskB k,_ml 281 a,_ml 281 b); 

VPSUBQ_ml 281 _mm_maskz_sub_epi64(_mmaskS k,_ml 281 a,_ml 281 b); 

PSUBQ:_m64 _mm_sub_si64(_m64 ml,_m64 m2) 

(V)PSUBQ:_ml 281 _mm_sub_epl64(_ml 281 ml,_ml 281 m2) 

VPSUBQ:_m256l_mm256_sub_epl64(_m256i m1,_m256l m2) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded VPSUBQ, see Exceptions Type E4. 
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PSUBSB/PSUBSW—Subtract Packed Signed Integers with Sig 

ned Saturation 

Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF E8 /r' 

PSUBSB mm, mm/m64 

RM 

V/V 

MMX 

Subtract signed packed bytes in mm/m64 from 
signed packed bytes in mm and saturate results. 

66 OF E8 k 

PSUBSB xmm 7, xmmZ/m 7 Z8 

RM 

v/v 

SSE2 

Subtract packed signed byte integers in 
xmmZ/mlZ8kom packed signed byte integers 
in xmm 7 and saturate results. 

OF E9 /r' 

PSUBSW mm, mm/m64 

RM 

V/V 

MMX 

Subtract signed packed words in mm/m64 from 
signed packed words in mm and saturate 
results. 

66 OF E9 /r 

PSUBSW xmm 1, xmmZ/m 7 Z8 

RM 

v/v 

SSE2 

Subtract packed signed word integers in 
xmm2/m 7 28 from packed signed word integers 
in xmm 7 and saturate results. 

VEX.NDS.128.66.0F.WIG E8 /r 

VPSUBSB xmm 1, xmmZ, xmm3/m 7 Z8 

RVM 

v/v 

AVX 

Subtract packed signed byte integers in 
xmm3/m 7 28 from packed signed byte integers 
in xmm2and saturate results. 

VEX.NDS.128.66.0F.WIG E9 /r 

VPSUBSW xmml, xmmZ, xmm3/mlZ8 

RVM 

v/v 

AVX 

Subtract packed signed word integers in 
xmm3/m728from packed signed word integers 
in xmm2and saturate results. 

VEX.NDS.256.66.0F.WIG E8 /r 

VPSUBSB ymmi, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Subtract packed signed byte integers in 
ymm3/mZ56 from packed signed byte integers 
in ymm2and saturate results. 

VEX.NDS.256.66.0F.WIG E9 /r 

VPSUBSW ymml, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Subtract packed signed word integers in 
ymm3/m256from packed signed word integers 
in ymm2and saturate results. 

EVEX.NDS.128.66.0F.WIG E8 /r 

VPSUBSB xmml [k1}{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed signed byte integers in 
xmm3/m128 from packed signed byte integers 
in xmm2 and saturate results and store in 
xmml using writemask k1. 

EVEX.NDS.256.66.0F.WIG E8 /r 

VPSUBSB ymml {k1 }{z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed signed byte integers in 
ymm3/m256 from packed signed byte integers 
in ymm2 and saturate results and store in 
ymml using writemask k1. 

EVEX.NDS.512.66.0F.WIG E8 /r 

VPSUBSB zmmi [k1 }{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Subtract packed signed byte integers in 
zmm3/m512 from packed signed byte integers 
in zmm2 and saturate results and store in zmmi 
using writemask k1. 

EVEX.NDS.128.66.0F.WIG E9 /r 

VPSUBSW xmml [k1 ][z], xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed signed word integers in 
xmm3/m128 from packed signed word integers 
in xmm2 and saturate results and store in 
xmml using writemask k1. 

EVEX.NDS.256.66.0F.WIG E9 /r 

VPSUBSW ymml [k1 }{z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed signed word integers in 
ymm3/m256 from packed signed word integers 
in ymm2 and saturate results and store in 
ymml using writemask k1. 
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EVEX.NDS.512.66.0F.WIG E9 /r 

FVM 

V/V 

AVX512BW 

Subtract packed signed word integers in 

VPSUBSW zmmi {k1}{z}, zmm2, zmm3/m512 




zmm3/m512 from packed signed word integers 





in zmm2 and saturate results and store in zmmi 





using writemask kl. 


NOTES: 

1. See note In Section Z.4, "AVX and SSE Instruction Exception Specification" in the Intel* 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.Z5.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Intel* 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD subtract of the packed signed integers of the source operand (second operand) from the packed 
signed integers of the destination operand (first operand), and stores the packed integer results in the destination 
operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an 
illustration of a SIMD operation. Overflow is handled with signed saturation, as described in the following para¬ 
graphs. 

The (V)PSUBSB instruction subtracts packed signed byte integers. When an individual byte result is beyond the 
range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value of 7FH or 80H, 
respectively, is written to the destination operand. 

The (V)PSUBSW instruction subtracts packed signed word integers. When an individual word result is beyond the 
range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the saturated value of 7FFFH or 
8000H, respectively, is written to the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source 
operand can be either an MMX technology register or a 64-bit memory location. 

128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM desti¬ 
nation register remain unchanged. 

VEX. 128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 

VEX.256 encoded versions: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding ZMM 
register are zeroed. 

EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory 
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi¬ 
tionally updated with writemask kl. 

Operation 

PSUBSB (with 64-bit operands) 

DEST[7:0] ^ SaturateToSignedByte (DEST[7:0] - SRC (7:0]); 

(* Repeat subtract operation for 2nd through 7th bytes *) 

DEST[63:56] ^ SaturateToSignedByte (DEST[63:56] - SRC[63:56]); 
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PSUBSW (with 64-bit operands) 

DEST[15:0] ^ SaturateToSIgnedWord (DEST[15:0] - SRC[15:0]); 

(* Repeat subtract operation for 2nd and 7th words *) 

DEST[63:48] ^ SaturateToSIgnedWord (DEST[63:48] - SRC[63:48]); 

VPSUBSB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^ j * 8; 

IF k10] OR *no writemask* 

THEN DEST[i+7:i] ^ SaturateToSignedByte (SRC1 [i+7:i] - SRC2[i+7:i]) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] ^ 0; 

FI 

FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 

VPSUBSW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^j* 16 

IF k10] OR *no writemask* 

THEN DEST[i+15:1] ^ SaturateToSIgnedWord (SRC1 [i+15:1] - SRC2[i+15:i]) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+15:1] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+15:i]^0; 

FI 

FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0; 

VPSUBSB (VEX.256 encoded version) 

DEST[7:0] ^ SaturateToSignedByte (SRC1 [7:0] - SRC2[7:0]); 

(* Repeat subtract operation for 2nd through 31th bytes *) 

DEST[255:248] ^ SaturateToSignedByte (SRC1 [255:248] - SRC2[255:248]); 
DEST[MAX_VL-1:256] ^0; 

VPSUBSB (VEX.12B encoded version) 

DEST[7:0] ^ SaturateToSignedByte (SRC1 [7:0] - SRC2[7:0]); 

(* Repeat subtract operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToSignedByte (SRC1 [127:120] - SRC2[127:120]); 
DEST[MAX_VL-1:128]^0; 

PSUBSB (128-bit Legacy SSE Version) 

DEST[7:0] ^ SaturateToSignedByte (DEST[7:0] - SRC[7:0]); 

(* Repeat subtract operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToSignedByte (DEST[127:120] - SRC[127:120]); 
DEST[MAX_VL-1:128] (Unmodified); 
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VPSUBSW (VEX.256 encoded version) 

DEST[15:0] ^ SaturateToSIgnedWord (SRC1 [15:0] - SRC2[15:0]); 

(* Repeat subtract operation for 2nd through 15th words *) 

DEST[255:240] ^ SaturateToSIgnedWord (SRC1 [255:240] - SRC2[255:240]); 
DEST[MAX_VL-1:256]^0; 

VPSUBSW (VEX.128 encoded version) 

DEST[15:0] ^ SaturateToSIgnedWord (SRC1 [15:0] - SRC2[15:0]); 

(* Repeat subtract operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToSIgnedWord (SRC1 [127:112] - SRC2[127:112]); 
DEST[MAX_VL-1:128]^0; 

PSUBSW (128-bit Legacy SSE Version) 

DEST[15:0] ^ SaturateToSIgnedWord (DEST[15:0] - SRC[15:0]); 

(* Repeat subtract operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToSIgnedWord (DEST[127:112] - SRC[127:112]); 

DEST[MAX_VL-1:128] (Unmodified); 

Intel C/C-t-i- Compiler Intrinsic Equivalents 

VPSUBSB _m512i _mm512_subs_epi8(_m5121 a, _m512i b); 

VPSUBSB_m5121 _mm512_mask_subs_epi8(_m5121 s,_mmask64 k,_m5121 a,_m5121 b); 

VPSUBSB_m5121 _mm512_maskz_subs_epi8(_mmask64 k,_m5121 a,_m5121 b); 

VPSUBSB_m256i _mm256_mask_subs_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPSUBSB_m256i _mm256_maskz_subs_epi8(_mmask32 k,_m256i a,_m256i b); 

VPSUBSB_ml 281 _mm_mask_subs_epi8(_ml 281 s,_mmasklB k,_ml 281 a,_m128i b); 

VPSUBSB_ml281 _mm_maskz_subs_epi8(_mmasklB k,_ml281 a,_m128i b); 

VPSUBSW _m5121 _mm512_subs_epi16(_m5121 a, _m5121 b); 

VPSUBSW_m512i_mm512_mask_subs_epi16(_m512i s,_mmask32 k,_m512i a,_m512i b); 

VPSUBSW_m5121 _mm512_maskz_subs_epi16(_mmask32 k,_m5121 a,_m512i b); 

VPSUBSW_m256i_mm256_mask_subs_epi16(_m256i s,_mmasklB k,_m25Bi a,_m25Bi b); 

VPSUBSW_m258i_mm258_maskz_subs_epi1B(_mmasklB k,_m25Bi a,_m25Bi b); 

VPSUBSW_ml 281 _mm_mask_subs_epi18(_ml 281 s,_mmaskB k,_ml 281 a,_ml 28i b); 

VPSUBSW_ml 281 _mm_maskz_subs_epi18(_mmaskB k,_ml 28i a,_ml 281 b); 

PSUBSB:_mB4 _mm_subs_pi8(_m84 ml,_m84 m2) 

(V)PSUBSB:_ml281 _mm_subs_epi8(_ml28i ml,_ml281 m2) 

VPSUBSB:_m258i_mm25B_subs_epi8(_m25Bi m1,_m25Bi m2) 

PSUBSW:_mB4_mm_subs_pi18(_mB4 ml,_m84 m2) 

(V)PSUBSW:_m128i_mm_subs_epi18(_m128i m1,_m128i m2) 
VPSUBSW:_m258i_mm258_subs_epi1B(_m258i m1,_m25Bi m2) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4.nb. 
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PSUBUSB/PSUBUSW—Subtract Packed Unsigned Integers with Unsigned Saturation 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Fiag 

Description 

OF D8 /r' 

PSUBUSB mm, mm/m64 

RM 

V/V 

MMX 

Subtract unsigned packed bytes in mm/m64 
from unsigned packed bytes in mm and 
saturate result. 

66 OF 08 Ir 

PSUBUSB xmm 1, xmmZ/m 128 

RM 

v/v 

SSE2 

Subtract packed unsigned byte integers in 
xmmZ/m 128 from packed unsigned byte 
integers in xmml and saturate result. 

OF 09 /r' 

PSUBUSW mm, mm/m64 

RM 

V/V 

MMX 

Subtract unsigned packed words in mm/m64 
from unsigned packed words in mm and 
saturate result. 

66 OF 09 Ir 

PSUBUSW xmm 1, xmmZ/m 128 

RM 

v/v 

SSE2 

Subtract packed unsigned word integers in 
xmm2/ml28irom packed unsigned word 
integers in xmml and saturate result. 

VEX.N0S.128.66.0F.WIG 08 /r 

VPSUBUSB xmml, xmmZ, xmm3/ml28 

RVM 

v/v 

AVX 

Subtract packed unsigned byte integers in 
xmm3/m 7 28 from packed unsigned byte 
integers in xmm2 and saturate result. 

VEX.N0S.128.66.0F.WIG 09 /r 

VPSUBUSW xmml, xmmZ, xmm3/ml28 

RVM 

v/v 

AVX 

Subtract packed unsigned word integers in 
xmm3/m 128 from packed unsigned word 
integers in xmm2 and saturate result. 

VEX.N0S.256.66.0F.WIG 08 /r 

VPSUBUSB ymml, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Subtract packed unsigned byte integers in 
ymm3/m256 from packed unsigned byte 
integers in ymm2 and saturate result. 

VEX.N0S.256.66.0F.WIG 09 /r 

VPSUBUSW ymml, ymmZ, ymm3/m256 

RVM 

v/v 

AVX2 

Subtract packed unsigned word integers in 
ymm3/m256 from packed unsigned word 
integers in ymm2 and saturate result. 

EVEX.N0S.128.66.0F.WIG 08 /r 

VPSUBUSB xmml {k1]{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed unsigned byte integers in 
xmm3/m128 from packed unsigned byte 
integers in xmm2, saturate results and store 
in xmml using writemask k1. 

EVEX.N0S.256.66.0F.WIG 08 /r 

VPSUBUSB ymml {k1 }[z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed unsigned byte integers in 
ymm3/m256 from packed unsigned byte 
integers in ymm2, saturate results and store 
in ymml using writemask k1. 

EVEX.N0S.512.66.0F.WIG 08 /r 

VPSUBUSB zmmi {k1}{z}, zmm2, zmm3/m512 

FVM 

v/v 

AVX512BW 

Subtract packed unsigned byte integers in 
zmm3/m512 from packed unsigned byte 
integers in zmm2, saturate results and store 
in zmmi using writemask k1. 

EVEX.N0S.128.66.0F.WIG 09 /r 

VPSUBUSW xmml [k1 }[z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed unsigned word integers in 
xmm3/m128 from packed unsigned word 
integers in xmm2 and saturate results and 
store in xmml using writemask k1. 

EVEX.N0S.256.66.0F.WIG 09 /r 

VPSUBUSW ymml {k1 }{z}, ymm2, ymm3/m256 

FVM 

v/v 

AVX512VL 

AVX512BW 

Subtract packed unsigned word integers in 
ymm3/m256 from packed unsigned word 
integers in ymm2, saturate results and store 
in ymml using writemask k1. 
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EVEX.NDS.512.66.0F.WIG D9 /r 

FVM 

V/V 

AVX512BW 

Subtract packed unsigned word integers in 

VPSUBUSW zmmi {k1}{z}, zmm2, zmm3/m512 




zmm3/m512 from packed unsigned word 
integers in zmm2, saturate results and store 
in zmmi using writemask kl. 


NOTES: 

1. See note in Section 2.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-3Z Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD subtract of the packed unsigned integers of the source operand (second operand) from the 
packed unsigned integers of the destination operand (first operand), and stores the packed unsigned integer 
results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer's 
Manual, Volume 1, for an illustration of a SIMD operation. Overflow is handled with unsigned saturation, as 
described in the following paragraphs. 

These instructions can operate on either 64-bit or 128-bit operands. 

The (V)PSUBUSB instruction subtracts packed unsigned byte integers. When an individual byte result is less than 
zero, the saturated value of OOH is written to the destination operand. 

The (V)PSUBUSW instruction subtracts packed unsigned word integers. When an individual word result is less than 
zero, the saturated value of OOOOH is written to the destination operand. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source 
operand can be either an MMX technology register or a 64-bit memory location. 

128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM desti¬ 
nation register remain unchanged. 

VEX. 128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 

VEX.256 encoded versions: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding ZMM 
register are zeroed. 

EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory 
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi¬ 
tionally updated with writemask kl. 

Operation 

PSUBUSB (with 64-bit operands) 

DEST[7:0] ^ SaturateToUnsignedByte (DEST[7:0] - SRC (7:0]); 

(* Repeat add operation for 2nd through 7th bytes *) 

DEST[63:56] ^ SaturateToUnsignedByte (DEST[63:56] - SRC[63:56]; 
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PSUBUSW (with 64-bit operands) 

DEST[15:0] ^ SaturateToUnsIgnedWord (DEST[15:0] - SRC[15:0] ); 

(* Repeat add operation for 2nd and 3rd words *) 

DEST[63:48] ^ SaturateToUnsIgnedWord (DEST[63:48] - SRC[63:48]); 

VPSUBUSB (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

FOR] ^0 TO KL-1 
i ^ j * 8; 

IF k10] OR *no writemask* 

THEN DEST[i+7:i] ^ SaturateToUnsignedByte (SRC1 [i+7:i] - SRC2[i+7:i]) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[K7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i-H7:i] ^ 0; 

FI 

FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0; 

VPSUBUSW (EVEX encoded versions) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

FOR] ^0 TO KL-1 
i ^ j * 16; 

IF k10] OR *no writemask* 

THEN DEST[i-Hl 5:1] ^ SaturateToUnsIgnedWord (SRC1 [\+^ 5:i] - SRC2[i-Hl 5:1]) 
ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i-^15:1] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i-Hl5:i]^0; 

FI 

FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0; 

VPSUBUSB (VEX.256 encoded version) 

DEST[7:0] ^ SaturateToUnsignedByte (SRC1 [7:0] - SRC2[7:0]); 

(* Repeat subtract operation for 2nd through 31 st bytes *) 

DEST[255:148] ^ SaturateToUnsignedByte (SRC1 [255:248] - SRC2[255:248]); 
DEST[MAX_VL-1:256]^0; 

VPSUBUSB (VEX.128 encoded version) 

DEST[7:0] ^ SaturateToUnsignedByte (SRC1 [7:0] - SRC2[7:0]); 

(* Repeat subtract operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToUnsignedByte (SRC1 [127:120] - SRC2[127:120]); 
DEST[MAX_VL-1:128]^0 

PSUBUSB (128-bit Legacy SSE Version) 

DEST[7:0] ^ SaturateToUnsignedByte (DEST[7:0] - SRC[7:0]); 

(* Repeat subtract operation for 2nd through 14th bytes *) 

DEST[127:120] ^ SaturateToUnsignedByte (DEST[127:120] - SRC[127:120]); 
DEST[MAX_VL-1:128] (Unmodified) 
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VPSUBUSW (VEX.Z56 encoded version) 

DEST[15:0] ^ SaturateToUnsignedWord (SRC1 [15:0] - SRC2[15:0]); 

(* Repeat subtract operation for 2nd through 15th words *) 

DEST[255:240] ^ SaturateToUnsignedWord (SRC1 [255:240] - SRC2[255:240]); 
DEST[MAX_VL-1:256]^0; 

VPSUBUSW (VEX.128 encoded version) 

DEST[15:0] ^ SaturateToUnsignedWord (SRC1 [15:0] - SRC2[15:0]); 

(* Repeat subtract operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToUnsignedWord (SRC1 [127:112] - SRC2[127:112]); 
DEST[MAX_VL-1:128]^0 

PSUBUSW (128-bit Legacy SSE Version) 

DEST[15:0] ^ SaturateToUnsignedWord (DEST[15:0] - SRC[15:0]); 

(* Repeat subtract operation for 2nd through 7th words *) 

DEST[127:112] ^ SaturateToUnsignedWord (DEST[127:112] - SRC[127:112]); 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalents 

VPSUBUSB _m512i _mm512_subs_epu8(_m512i a, _m512i b); 

VPSUBUSB_m512i _mm512_mask_subs_epu8(_m512i s,_mmask64 k,_m512i a,_m512i b); 

VPSUBUSB_m512i_mm512_maskz_subs_epu8(_mmask64 k,_m512i a,_m512i b); 

VPSUBUSB_m256i _mm256_mask_subs_epu8(_m256i s,_mmaskBZ k,_m256i a,_m256i b); 

VPSUBUSB_m256i _mm256_maskz_subs_epu8(_mmaskBZ k,_m256i a,_m256i b); 

VPSUBUSB_ml 28i _mm_mask_subs_epu8(_ml 28i s,_mmaski 6 k,_ml 28i a,_ml 28i b); 

VPSUBUSB_ml 28i _mm_maskz_subs_epu8(_mmaski 6 k,_m128i a,_m128i b); 

VPSUBUSW _m512i _mm512_subs_epu16(_m512i a, _m512i b); 

VPSUBUSW_m512i _mm512_mask_subs_epu16(_m512i s,_mmaskBZ k,_m512i a,_m512i b); 

VPSUBUSW_m512i _mm512_maskz_subs_epu16(_mmaskBZ k,_m512i a,_m512i b); 

VPSUBUSW_m256i _mm256_mask_subs_epu16(_m256i s,_mmaski 6 k,_m256i a,_m256i b); 

VPSUBUSW_m256i_mm256_maskz_subs_epu16(_mmaski 6 k,_m256i a,_m256i b); 

VPSUBUSW_m128i_mm_mask_subs_epu16(_m128i s,_mmaskB k,_m128i a,_m128i b); 

VPSUBUSW_ml 28i _mm_maskz_subs_epu16(_mmaskB k,_ml 28i a,_ml 28i b); 

PSUBUSB:_m64_mm_subs_pu8(_m64 ml,_m64 m2) 

(V)PSUBUSB:_ml 28i_mm_subs_epu8(_ml 28i ml,_ml 28i m2) 

VPSUBUSB:_m256i_mm256_subs_epu8(_m256i m1,_m256i m2) 

PSUBUSW:_m64_mm_subs_pu16(_m64 ml,_m64 m2) 

(V)PSUBUSW:_m128i_mm_subs_epu16(_m128i m1,_m128i m2) 
VPSUBUSW:_m256i_mm256_subs_epu16(_m256i m1,_m256i m2) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4. 
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PTEST- Logical Compare 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 36 17/r 

PTEST xmm 1, xmmZ/m 128 

RM 

V/V 

SSE4_1 

Set ZF if xmm2/m128 AND xmmi result is all 
Os. Set CF if xmm2/m 128 AND NOT xmm 1 
result is all Os. 

VEX.128.66.0F38.WIG 17/r 

VPTEST xmmi, xmm2/m128 

RM 

v/v 

AVX 

Set ZF and CF depending on bitwise AND and 
ANDN of sources. 

VEX.256.66.0F38.WIG 17/r 

VPTEST ymm 1, \/mm2/m256 

RM 

V/V 

AVX 

Set ZF and CF depending on bitwise AND and 
ANDN of sources. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

NA 

NA 


Description 

PTEST and VPTEST set the ZF flag if all bits in the result are 0 of the bitwise AND of the first source operand (first 
operand) and the second source operand (second operand). VPTEST sets the CF flag if all bits in the result are 0 of 
the bitwise AND of the second source operand (second operand) and the logical NOT of the destination operand. 

The first source register is specified by the ModR/M reg field. 

128-bit versions: The first source register is an XMM register. The second source register can be an XMM register 
or a 128-bit memory location. The destination register is not modified. 

VEX.256 encoded version: The first source register is a VMM register. The second source register can be a VMM 
register or a 256-bit memory location. The destination register is not modified. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 

Operation 

(V)PTEST (128-bit version) 

IF (SRC[127:0] BITWISE AND DEST[127:0] = 0) 

THEN ZF^ 1; 

ELSE ZF ^ 0; 

IF (SRC[127:0] BITWISE AND NOT DEST[127:0] = 0) 

THEN CF^ 1; 

ELSE CF ^ 0; 

DEST (unmodified) 

AF ^ OF ^ PF ^ SF ^ 0; 

VPTEST (VEX.256 encoded version) 

IF (SRC[255:0] BITWISE AND DEST[255:0] = 0) THEN ZF ^ 1; 

ELSE ZF ^ 0; 

IF (SRC[255:0] BITWISE AND NOT DEST[255:0] = 0) THEN CF ^ 1; 

ELSE CF ^ 0; 

DEST (unmodified) 

AF ^ OF ^ PF ^ SF ^ 0; 
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Intel C/C++ Compiler Intrinsic Equivalent 

PTEST 

int_mm_testz_si128 (_m128i si,_ml281 s2); 

int_mm_testc_sl128 (_ml 281 si,_ml 281 s2); 

int_mm_testnzc_sl128 (_ml 281 si,_ml 281 s2); 

VPTEST 

int _mm256_testz_sl256 (_m256l si,_m256i s2); 

int_mm256_testc_sl256 (_m256l si,_m256l s2); 

int_mm256_testnzc_sl256 (_m256l si,_m256i s2); 

int_mm_testz_sl128 (_ml 281 si,_ml 281 s2); 

int_mm_testc_si128 (_ml 281 si,_ml 281 s2); 

int_mm_testnzc_sl128 (_ml 281 si,_ml 281 s2); 

Flags Affected 

The OF, AF, PF, SF flags are cleared and the ZF, CF flags are set according to the operation. 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD If VEX.vvvv iiiiB. 
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PTWRIT6 - Write Data to a Processor Trace Packet 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

F3 REX.W OF AE /4 

PTWRITE r64/m64 

RM 

V/N.E 


Reads the data from r64/m64 to encod into a 
PTW packet if dependencies are met (see 
details below). 

F3 OF AE /4 

PTWRITE r32/m32 

RM 

V/V 


Reads the data from r32/m32 to encode into a 
PTW packet if dependencies are met (see 
details below). 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:rm (r) 

NA 

NA 

NA 


Description 

This instruction reads data in the source operand and sends it to the Intel Processor Trace hardware to be encoded 
in a PTW packet if TriggerEn, ContextEn, FilterEn, and PTWEn are all set to 1. For more details on these values, see 
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3C, Section 36.2.3, "Power Event 
Tracing". The size of data is 64-bit if using REX.W in 64-bit mode, otherwise 32-bits of data are copied from the 
source operand. 

Note: The instruction will #UD if prefix 66FI is used. 

Operation 

IF (IA32_RTIT_STATUS.TrlggerEn & IA32_RTIT_STATUS.ContextEn & IA32_RTIT_STATUS.FIIterEn & IA32_RTIT_CTL.PTWEn) = 1 
PTW.PayloadBytes <- Encoded payload size; 

PTW.IP ^ IA32_RTIT_CTL.FUPonPTW 
IF IA32_RTIT_CTL.FUPonPTW = 1 

Insert FUP packet with IP of PTWRITE; 

FI; 

FI; 

Flags Affected 

None. 

Other Exceptions 

#GP(0) 

#SS(0) 

#PF (fault-code) 

#AC(0) 

#UD 


If a memory operand effective address is outside the CS, DS, ES, FS or GS segments. 

If a memory operand effective address is outside the SS segment limit. 

For a page fault. 

If an unaligned memory reference is made while the current privilege level is 3 and alignment 
checking is enabled. 

If CPUID.(EAX=14H, ECX=0):EBX.PTWRITE [Bit 4] = 0. 

If LOCK prefix is used. 

If 66FI prefix is used. 
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Real-Address Mode Exceptions 

#GP(0) If any part of the operand lies outside of the effective address space from 0 to OFFFFH. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#UD If CPUID.(EAX=14H, ECX=0):EBX.PTWRITE [Bit 4] = 0. 

If LOCK prefix is used. 

If 66H prefix is used. 

Virtual 8086 Mode Exceptions 

#GP(0) If any part of the operand lies outside of the effective address space from 0 to OFFFFH. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF (fault-code) For a page fault. 

#AC(0) If an unaligned memory reference is made while alignment checking is enabled. 

#UD If CPUID.(EAX=14H, ECX=0):EBX.PTWRITE [Bit 4] = 0. 

If LOCK prefix is used. 

If 66H prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in Protected Mode. 

64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#PF (fault-code) For a page fault. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If CPUID.(EAX=14H, ECX=0):EBX.PTWRITE [Bit 4] = 0. 

If LOCK prefix is used. 

If 66H prefix is used. 
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PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ- Unpack High Data 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 68 /r' 

PUNPCKHBW mm, mm/m64 

RM 

V/V 

MMX 

Unpack and interleave high-order bytes from 
mm and mm/m64 into mm. 

66 OF 68 /r 

PUNPCKHBW xmml, xmmZ/mlZQ 

RM 

v/v 

SSE2 

Unpack and interleave high-order bytes from 
xmml and xmmZ/mlZ8\nto xmml. 

OF 69 /r' 

PUNPCKHWD mm, mm/m64 

RM 

V/V 

MMX 

Unpack and interleave high-order words from 
mm and mm/m64 into mm. 

66 OF 69 /r 

PUNPCKHWD xmml, xmmZ/m1Z8 

RM 

v/v 

SSE2 

Unpack and interleave high-order words from 
xmml and xmmZ/mlZ8\nto xmml. 

OF 6A /r' 

PUNPCKHDQ mm, mm/m64 

RM 

v/v 

MMX 

Unpack and interleave high-order 
doublewords from mm and mm/m64 into mm. 

66 OF 6A Ir 

PUNPCKHDQ xmml, xmmZ/m1Z8 

RM 

v/v 

SSE2 

Unpack and interleave high-order 
doublewords from xmm 1 and xmmZ/m 1Z8 
into xmml. 

66 0F6D/r 

PUNPCKHQDQ xmml, xmmZ/mlZ8 

RM 

v/v 

SSE2 

Unpack and interleave high-order guadwords 
from xmm 7 and xmmZ/m 1Z8 into xmm 1. 

VEX.NDS.128.66.0F.WIG 68/r 

VPUNPCKHBW xmmhxmmZ, xmm3/mlZ8 

RVM 

v/v 

AVX 

Interleave high-order bytes from xmm2 and 
xmm 3/m 1Z8 i nto xmm 1. 

VEX.NDS.128.66.0F.WIG 69/r 

VPUNPCKHWD xmmhxmmZ, xmm3/mlZ8 

RVM 

v/v 

AVX 

Interleave high-order words from xmmZ and 
xmm3/m 1Z8 i nto xmm 1. 

VEX.NDS.128.66.0F.WIG 6A/r 

VPUNPCKHDQ xmm 1, xmmZ, xmm3/m 1Z8 

RVM 

v/v 

AVX 

Interleave high-order doublewords from 
xmmZ and xmm3/m 1Z8 into xmml. 

VEX.NDS.128.66.0F.WIG 6D/r 

VPUNPCKHQDQ xmml, xmmZ, xmm3/mlZ8 

RVM 

v/v 

AVX 

Interleave high-order quadword from xmmZ 
and xmm3/mlZ8into xmml register. 

VEX.NDS.256.66.0F.WIG 68 /r 

VPUNPCKHBW ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Interleave high-order bytes from ymrn2 and 
ymm3/mZ56 into ymml register. 

VEX.NDS.256.66.0F.WIG 69 /r 

VPUNPCKHWD ymml, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Interleave high-order words from ymmZ and 
ymm3/mZ56 into ymml register. 

VEX.NDS.256.66.0F.WIG 6A /r 

VPUNPCKHDQ ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Interleave high-order doublewords from 
ymmZ and ymm3/mZ56 into ymm 1 register. 

VEX.NDS.256.66.0F.WIG 6D /r 

VPUNPCKHQDQ ymm 1, ymmZ, ymm3/mZ56 

RVM 

v/v 

AVX2 

Interleave high-order quadword from ymmZ 
and ymm3/mZ56 Into ymml register. 

EVEX.NDS.128.66.0F.WIG 68 /r 

VPUNPCKHBW xmml {k1}{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Interleave high-order bytes from xmm2 and 
xmm3/m128 into xmml register using k1 
write mask. 

EVEX.NDS.128.66.0F.WIG 69 /r 

VPUNPCKHWD xmml {k1}{z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Interleave high-order words from xmm2 and 
xmm3/m128 into xmml register using k1 
write mask. 

EVEX.NDS.128.66.0F.W0 6A /r 

VPUNPCKHDQ xmml [k1 }[z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Interleave high-order doublewords from 
xmm2 and xmm3/m128/m32bcst into xmml 
register using k1 write mask. 

EVEX.NDS.128.66.0F.W1 6D/r 

VPUNPCKHQDQ xmml {k1}[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Interleave high-order quadword from xmm2 
and xmm3/m128/m64bcst into xmml 
register using k1 write mask. 
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EVEX.NDS.256.66.0F.WIG 68 /r 

VPUNPCKHBW ymmi [k1 }[z], ymm2, ymm3/m256 

FVM 

V/V 

AVX512VL 

AVX512BW 

Interleave high-order bytes from ymm2 and 
ymm3/m256 into ymmi register using k1 
write mask. 

EVEX.NDS.256.66.0F.WIC 69 /r 

VPUNPCKHWD ymmi {k1 }{z}, ymm2, ymm3/m256 

FVM 

V/V 

AVX512VL 

AVX512BW 

Interleave high-order words from ymm2 and 
ymm3/m256 into ymmi register using k1 
write mask. 

EVEX.NDS.256.66.0F.W0 6A /r 

VPUNPCKHDQymmI {k1}{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Interleave high-order doublewords from 
ymm2 and ymm3/m256/m32bcst into ymmi 
register using k1 write mask. 

EVEX.NDS.256.66.0F.W1 6D /r 

VPUNPCKHQDQymmI [k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Interleave high-order quadword from ymm2 
and ymm3/m256/m64bcst into ymmi 
register using k1 write mask. 

EVEX.NDS.512.66.0F.WIG 68/r 

VPUNPCKHBW zmmi {k1}{z}, zmm2, zmm3/m512 

FVM 

V/V 

AVX512BW 

Interleave high-order bytes from zmm2 and 
zmm3/m512 into zmmi register. 

EVEX.NDS.512.66.0F.WIG69/r 

VPUNPCKHWD zmmi {k1}[z}, zmm2, zmm3/m512 

FVM 

V/V 

AVX512BW 

Interleave high-order words from zmm2 and 
zmm3/m512 into zmmi register. 

EVEX.NDS.512.66.0F.W0 6A /r 

VPUNPCKHDQzmmI (k1 }[z}, zmm2, 
zmm3/m512/m32bcst 

FV 

V/V 

AVX512F 

Interleave high-order doublewords from 
zmm2 and zmm3/m512/m32bcst into zmmi 
register using k1 write mask. 

EVEX.NDS.512.66.0F.W1 6D/r 

VPUNPCKHQDQzmmI {k1}[z}, zmm2, 
zmm3/m512/m64bcst 

FV 

V/V 

AVX512F 

Interleave high-order quadword from zmm2 
and zmm3/m512/m64bcst into zmmi register 
using k1 write mask. 


NOTES: 

1. See note In Section 2.4, "AVX and SSE Instruction Exception Specification" in the Intel" 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Unpacks and interleaves the high-order data elements (bytes, words, doublewords, or quadwords) of the destina¬ 
tion operand (first operand) and source operand (second operand) into the destination operand. Figure 4-20 shows 
the unpack operation for bytes in 64-bit operands. The low-order data elements are ignored. 
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Figure 4-20. PUNPCKHBW Instruction Operation Using e4-bit Operands 


255 31 0 255 31 0 



Figure 4-21. 256-bit VPUNPCKHDQ Instruction Operation 

When the source data comes from a 64-bit memory operand, the full 64-bit operand is accessed from memory, but 
the instruction uses only the high-order 32 bits. When the source data comes from a 128-bit memory operand, an 
implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal 
segment checking will still be enforced. 

The (V)PUNPCKHBW instruction interleaves the high-order bytes of the source and destination operands, the 
(V)PUNPCKHWD instruction interleaves the high-order words of the source and destination operands, the 
(V)PUNPCKHDQ instruction interleaves the high-order doubleword (or doublewords) of the source and destination 
operands, and the (V)PUNPCKHQDQ instruction interleaves the high-order quadwords of the source and destina¬ 
tion operands. 

These instructions can be used to convert bytes to words, words to doublewords, doublewords to quadwords, and 
quadwords to double quadwords, respectively, by placing all Os in the source operand. Here, if the source operand 
contains all Os, the result (stored in the destination operand) contains zero extensions of the high-order data 
elements from the original value in the destination operand. For example, with the (V)PUNPCKHBW instruction the 
high-order bytes are zero extended (that is, unpacked into unsigned word integers), and with the (V)PUNPCKHWD 
instruction, the high-order words are zero extended (unpacked into unsigned doubleword integers). 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory 
location. The destination operand is an MMX technology register. 

128-bit Legacy SSE versions: The second source operand is an XMM register or a 128-bit memory location. The 
first source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM 
destination register remain unchanged. 

VEX. 128 encoded versions: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 

VEX.256 encoded version: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. 
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EVEX encoded VPUNPCKHDQ/QDQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit 
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source 
operand and destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with 
writemask kl. 

EVEX encoded VPUNPCKHWD/BW: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit 
memory location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination 
is conditionally updated with writemask kl. 

Operation 

PUNPCKHBW instruction with 64-bit operands: 

DEST[7:0] ^ DEST[39:32]; 

DEST[15:8] ^ SRC[39:32]; 

DEST[23:16]^DEST[47:40]; 

DEST[31:24]^SRC[47:40]; 

DEST[39:32] ^ DEST[55:48]; 

DEST[47:40] ^ SRC[55:48]; 

DEST[55:48] ^ DEST[63:56]; 

DEST[63:56] ^ SRC[63:56]; 

PUNPCKHW instruction with 64-bit operands: 

DEST[15:0] ^ DEST[47:32]; 

DEST[31:16]^SRC[47:32]; 

DEST[47:32] ^ DEST[63:48]; 

DEST[63:48] ^ SRC[63:48]; 

PUNPCKHDQ instruction with 64-bit operands: 

DEST[31:0]^DEST[63:32]; 

DEST[63:32] ^ SRC[63:32]; 

INTERLEAVE_HIGH_BYTES_512b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_HICH_BYTES_256b(SRC1 [255:0], SRC[255:0]) 

TMP_DEST[511:256] <- INTERLEAVE_HIGH_BYTES_256b(SRC1 [511:256], SRC[511:256]) 

INTERLEAVE_HIGH_BYTES_256b (SRC1, SRC2) 

DEST[7:0]^SRC1 [71:64] 

DEST[15:8] <-SRC2[71:64] 

DEST[23:16] ^SRC1[79:72] 

DEST[31:24]^SRC2[79:72] 

DEST[39:32]^SRC1 [87:80] 

DEST[47:40] <- SRC2[87:80] 

DEST[55:48] ^SRCI [95:88] 

DEST[63:56] <- SRC2[95:88] 

DEST[71:64] ^SRCI [103:96] 

DEST[79:72] <- SRC2[103:96] 

DEST[87:80] ^SRC1[111:104] 

DEST[95:88]^SRC2[111:104] 

DEST[103:96]^SRC1[119:112] 

DEST[111:104] ^SRC2[119:112] 

DEST[119:112] <- SRC1 [127:120] 

DEST[127:120] <- SRC2[127:120] 

DEST[135:128] <- SRC1 [199:192] 

DEST[143:136] <- SRC2[199:192] 

DEST[151:144] <- SRC1 [207:200] 

DEST[159:152] i- SRC2[207:200] 
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DEST[167:160] <- SRC1 [215:208] 

DEST[175:168] <- SRC2[215:208] 

DEST[183:176] <- SRC1 [223:216] 

DEST[191:184] <- SRC2[223:216] 

DEST[199:192] ^SRCI [231:224] 

DEST[207:200] <- SRC2[231:224] 

DEST[215:208] <- SRC1 [239:232] 

DEST[223:216] <- SRC2[239:232] 

DEST[231:224] <- SRC1 [247:240] 

DEST[239:232] <- SRC2[247:240] 

DEST[247:240] <- SRC1 [255:248] 

DEST[255:248] <- SRC2[255:248] 

INTERLEAVE_HIGH_BYTES (SRC1, SRC2) 

DEST[7:0]^SRC1 [71:64] 

DEST[15:8] ^SRC2[71:64] 

DEST[23:16] ^SRC1[79:72] 

DEST[31:24] ^SRC2[79:72] 

DEST[39:32] <- SRC1 [87:80] 

DEST[47:40] <- SRC2[87:80] 

DEST[55:48] <- SRC1 [95:88] 

DEST[63:56] <- SRC2[95:88] 

DEST[71:64] ^SRCI [103:96] 

DEST[79:72] ^SRC2[103:96] 

DEST[87:80] ^SRC1[111:104] 

DEST[95:88] ^SRC2[111:104] 

DEST[103:96] ^SRC1[119:112] 

DEST[111:104] ^SRC2[119:112] 

DEST[119:112] <- SRC1 [127:120] 

DEST[127:120] <- SRC2[127:120] 

INTERLEAVE_HIGH_WORDS_512b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_HIGH_WORDS_256b(SRC1 [255:0], SRC[255:0]) 
TMP_DEST[511:256] <- INTERLEAVE_HIGH_WORDS_256b(SRC1 [511:256], SRC[511:256]) 

INTERLEAVE_HIGH_WORDS_256b(SRC1, SRC2) 

DEST[15:0] ^SRCI [79:64] 

DEST[31:16] ^SRC2[79:64] 

DEST[47:32] <- SRC1 [95:80] 

DEST[63:48] <- SRC2[95:80] 

DEST[79:64] ^SRC1[111:96] 

DEST[95:80] ^SRC2[111:96] 

DEST[111:96] ^SRCI [127:112] 

DEST[127:112] <- SRC2[127:112] 

DEST[143:128] <- SRC1 [207:192] 

DEST[159:144] <- SRC2[207:192] 

DEST[175:160] <- SRC1 [223:208] 

DEST[191:176] <- SRC2[223:208] 

DEST[207:192] <- SRC1 [239:224] 

DEST[223:208] <- SRC2[239:224] 

DEST[239:224] <- SRC1 [255:240] 

DEST[255:240] <- SRC2[255:240] 

INTERLEAVE_HIGH_WORDS (SRC1, SRC2) 
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DEST[15:0] ^SRCI [79:64] 

DEST[31:16] ^SRC2[79:64] 

DEST[47:32] ^SRCI [95:80] 

DEST[63:48] <- SRC2[95:80] 

DEST[79:64] ^SRC1[111:96] 

DEST[95:80] <- SRC2[111:96] 

DEST[111:96]^SRC1 [127:112] 

DEST[127:112] <- SRC2[127:112] 

INTERLEAVE_HIGH_DW0RDS_512b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_HIGH_DW0RDS_256b(SRC1 [255:0], SRC2[255:0]) 
TMP_DEST[511:256] <- INTERLEAVE_HIGH_DW0RDS_256b(SRC1 [511:256], SRC2[511:256]) 

INTERLEAVE_HIGH_DW0RDS_256b(SRC1, SRC2) 

DEST[31:0]^SRC1 [95:64] 

DEST[63:32] <- SRC2[95:64] 

DEST[95:64] ^SRCI [127:96] 

DEST[127:96] <- SRC2[127:96] 

DEST[159:128] <- SRC1 [223:192] 

DEST[191:160] <- SRC2[223:192] 

DEST[223:192] <- SRC1 [255:224] 

DEST[255:224] <- SRC2[255:224] 

INTERLEAVE_HIGH_DW0RDS(SRC1, SRC2) 

DEST[31:0]^SRC1 [95:64] 

DEST[63:32] <- SRC2[95:64] 

DEST[95:64] ^SRCI [127:96] 

DEST[127:96] <- SRC2[127:96] 

INTERLEAVE_HIGH_QW0RDS_512b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_HIGH_QW0RDS_256b(SRC1 [255:0], SRC2[255:0]) 
TMP_DEST[511:256] <- INTERLEAVE_HIGH_QW0RDS_256b(SRC1 [511:256], SRC2[511:256]) 

INTERLEAVE_HIGH_QW0RDS_256b(SRC1, SRC2) 

DEST[63:0]^SRC1 [127:64] 

DEST[127:64] <- SRC2[127:64] 

DEST[191:128] <- SRC1 [255:192] 

DEST[255:192] <- SRC2[255:192] 

INTERLEAVE_HIGH_QW0RDS(SRC1, SRC2) 

DEST[63:0]^SRC1 [127:64] 

DEST[127:64] <- SRC2[127:64] 


PUNPCKHBW (128-bit Legacy SSE Version) 

DEST[127:0] ^INTERLEAVE_HIGH_BYTES(DEST, SRC) 
DEST[255:127] (Unmodified) 

VPUNPCKHBW (VEX.128 encoded version) 

DEST[127:0] ^INTERLEAVE_HIGH_BYTES(SRC1, SRC2) 
DEST[511:127] ^0 


VPUNPCKHBW (VEX.256 encoded version) 

DEST[255:0] ^INTERLEAVE_HIGH_BYTES_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256] ^0 
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VPUNPCKHBW (EVEX encoded versions) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_BYTES(SRC1 [VL-1:0], SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_BYTES_256b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_BYTES_512b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

FORj^OTO KL-1 
i ^j*8 

IF k10] OR *no wrltemask* 

THEN DEST[i+7:l] ^ TMP_DEST[l+7:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


PUNPCKHWD (128-bit Legacy SSE Version) 

DEST[127:0] ^INTERLEAVE_HIGH_WORDS(DEST, SRC) 
DEST[255:127] (Unmodified) 


VPUNPCKHWD (VEX.128 encoded version) 

DEST[127:0] ^INTERLEAVE_HIGH_WORDS(SRC1, SRC2) 
DEST[511:127] ^0 


VPUNPCKHWD (VEX.256 encoded version) 

DEST[255:0] ^INTERLEAVE_HIGH_WORDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256] ^0 


VPUNPCKHWD (EVEX encoded versions) 

(KL, VL) = (8,1 28), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_WORDS(SRC1 [VL-1:0], SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_WORDS_256b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HICH_WORDS_512b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

FOR] ^0 TO KL-1 
i ^]* 16 

IF k10] OR *no wrltemask* 

THEN DEST[i+15:1] ^ TMP_DEST[i+15:i] 
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ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

TFIEN *DEST[I+15:1] remains unchanged* 

ELSE *zerolng-masking* ; zeroing-masking 

DEST[i+15:l]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

PUNPCKHDQ (1 Z8-bit Legacy SSE Version) 

DEST[127:0] ^INTERLEAVE_HIGH_DWORDS(DEST, SRC) 

DEST[255:127] (Unmodified) 

VPUNPCKHDQ (VEX.128 encoded version) 

DEST[127:0] ^INTERLEAVE_HIGH_DW0RDS(SRC1, SRC2) 
DEST[511:127] ^0 

VPUNPCKHDQ (VEX.256 encoded version) 

DEST[255:0] ^INTERLEAVE_HIGH_DWORDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256]^0 


VPUNPCKHDQ (EVEX.512 encoded version) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^]*32 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN TMP_SRC2[l+31:i] ^ SRC2[31:0] 

ELSE TMP_SRC2[I+31 :l] ^ SRC2[i+31 :i] 

FI; 

ENDFOR; 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_DWORDS(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_DWORDS_256b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 
FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_DWORDS_512b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 
FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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PUNPCKHQDQ (128-bit Legacy SSE Version) 

DEST[127:0] ^INTERLEAVE_HIGH_QWORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

VPUNPCKHQDQ (VEX.128 encoded version) 

DEST[127:0] ^INTERLEAVE_HIGH_QWORDS(SRC1, SRC2) 

DEST[MAX_VL-1:128]^0 

VPUNPCKHQDQ (VEX.256 encoded version) 

DEST[255:0] ^INTERLEAVE_HIGH_QWORDS_256b(SRC1, SRC2) 

DEST[MAX_VL-1:256]^0 

VPUNPCKHQDQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN TMP_SRC2[i+63:i] ^ SRC2[63:0] 

ELSE TMP_SRC2[i+63:i] ^ SRC2[i+63:i] 

FI; 

ENDFOR; 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HIGH_QWORDS(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HICH_QWORDS_256b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_HICH_QWORDS_512b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k10] OR *no writemask* 

THEN DEST[i+63:i] ^ TMP_DEST[i+63:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C++ Compiler Intrinsic Equivalents 

VPUNPCKHBW _m512i _mm512_unpackhLepi8(_m512i a, _m5121 b); 

VPUNPCKHBW_m512i _mm512_mask_unpackhi_epi8(_m512i s,_mmask64 k,_m512i a,_m5121 b); 

VPUNPCKHBW_m512i_mm512_maskz_unpackhi_epi8(_mmask64 k,_m5121 a,_m5121 b); 

VPUNPCKHBW_m256i _mm256_mask_unpackhi_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 

VPUNPCKHBW_m256i_mm256_maskz_unpackhi_epi8(_mmask32 k,_m256i a,_m256i b); 

VPUNPCKHBW_ml 28i _mm_mask_unpackhi_epi8(v s,_mmaski 6 k,_ml 28i a,_ml 281 b); 

VPUNPCKHBW_ml 28i _mm_maskz_unpackhi_epi8(_mmaski 6 k,_ml 281 a,_ml 281 b); 
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VPUNPCKHWD _m5121 _mm512_unpackhLepi16(_m512i a, _m5121 b); 

VPUNPCKHWD_mSI 21 _mm512_mask_unpackhi_epi16(_m512l s,_mmask32 k,_m512l a,_m512i b); 

VPUNPCKHWD_mSI 21 _mm512_maskz_unpackhl_epl16(_mmask32 k,_mSI 21 a,_mSI 21 b); 

VPUNPCKHWD_m256l _mm256_mask_unpackhi_epi16(_m256l s,_mmaski 6 k,_m256l a,_m256i b); 

VPUNPCKHWD_m256l_mm256_maskz_unpackhl_epl16(_mmaski 6 k,_m256l a,_m256l b); 

VPUNPCKHWD_ml 281 _mm_mask_unpackhl_epi16(v s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPUNPCKHWD_ml 281 _mm_maskz_unpackhl_epl16(_mmask8 k,_ml 281 a,_ml 28i b); 

VPUNPCKHDQ_m5121 _mm512_unpackhLepi32(_m512i a, _m5121 b); 

VPUNPCKHDQ_m512l_mm512_mask_unpackhl_epi32(_m512i s,_mmaski 6 k,_m512i a,_m512l b); 

VPUNPCKHDQ_mSI 21 _mm512_maskz_unpackhi_epl32(_mmaski 6 k,_m512l a,_m512i b); 

VPUNPCKHDQ_m256l _mm256_mask_unpackhi_epi32(_m512i s,_mmask8 k,_m512i a,_m512l b); 

VPUNPCKHDQ_m256l _mm256_maskz_unpackhl_epl32(_mmask8 k,_mSI 21 a,_mSI 21 b); 

VPUNPCKHDQ_ml 281 _mm_mask_unpackhl_epl32(_m512i s,_mmask8 k,_m512l a,_m512i b); 

VPUNPCKHDQ_ml 281 _mm_maskz_unpackhl_epl32(_mmask8 k,_mSI 21 a,_mSI 21 b); 

VPUNPCKHQDQ_m5121 _mm512_unpackhLepi64(_m5121 a,_m5121 b); 

VPUNPCKHQDQ_mSI 21 _mm512_mask_unpackhl_epl64(_m512i s,_mmask8 k,_mSI 2i a,_mSI 21 b); 

VPUNPCKHQDQ_mSI 21 _mm512_maskz_unpackhl_epl64(_mmask8 k,_m512l a,_m512i b); 

VPUNPCKHQDQ_m256l _mm256_mask_unpackhl_epl64(_m512i s,_mmask8 k,_m512i a,_m512l b); 

VPUNPCKHQDQ_m256l _mm256_maskz_unpackhl_epl64(_mmask8 k,_m512l a,_m512i b); 

VPUNPCKHQDQ_ml 281 _mm_mask_unpackhl_epl64(_m512l s,_mmask8 k,_m512i a,_m512l b); 

VPUNPCKHQDQ_ml 281 _mm_maskz_unpackhi_epi64(_mmask8 k,_mSI 21 a,_mSI 21 b); 

PUNPCKHBW:_m64 _mm_unpackhi_pl8(_m64 ml,_m64 m2) 

(V)PUNPCKHBW:_m1281 _mm_unpackhLepl8(_m1281 ml, _m1281 m2) 
VPUNPCKHBW:_m256l_mm256_unpackhLepi8(_m256l m1,_m256l m2) 

PUNPCKHWD:_m64_mm_unpackhl _pi16(_m64 ml,_m64 m2) 

(V)PUNPCKHWD:_m1281 _mm_unpackhLepl16(_m1281 m1,_m1281 m2) 
VPUNPCKHWD:_m256l_mm256_unpackhLepi16(_m256l m1,_m256i m2) 

PUNPCKHDQ:_m64 _mm_unpackhl_pi32(_m64 ml,_m64 m2) 

(V)PUNPCKHDQ:_m128l_mm_unpackhLepl32(_m128l m1,_m128i m2) 
VPUNPCKHDQ:_m256l_mm256_unpackhLepl32(_m256l m1,_m256l m2) 

(V)PUNPCKHQDQ:_m128i _mm_unpackhLepl64 (_m1281 a, _m1281 b) 

VPUNPCKHQDQ:_m256l _mm256_unpackhLepi64 (_m256i a_m256i b) 

Flags Affected 

None. 

Numeric Exceptions 

None. 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded VPUNPCKHQDQ/QDQ, see Exceptions Type E4NF. 

EVEX-encoded VPUNPCKHBW/WD, see Exceptions Type E4NF.nb. 
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PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ-Unpack Low Data 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 60 /r' 

PUNPCKLBW mm, mm/m32 

RM 

V/V 

MMX 

Interleave low-order bytes from mm and 
mm/m32 Into mm. 

66 OF 60 /r 

PUNPCKLBW xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Interleave low-order bytes from xmmi and 
xmm2/m 128 into xmm 1. 

OF 61 /r' 

PUNPCKLWD mm, mm/m32 

RM 

V/V 

MMX 

Interleave low-order words from mm and 
mm/m32 Into mm. 

66 OF 61 /r 

PUNPCKLWD xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Interleave low-order words from xmmi and 
xmm2/m 128 Into xmm 1. 

OF 62 /r' 

PUNPCKLDQ mm, mm/m32 

RM 

v/v 

MMX 

Interleave low-order doublewords from mm 
and mm/m32 Into mm. 

66 OF 62 /r 

PUNPCKLDQ xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Interleave low-order doublewords from xmmi 
and xmm2/m 128 into xmmi. 

66 OF 6C /r 

PUNPCKLQDQ xmmi, xmm2/m128 

RM 

v/v 

SSE2 

Interleave low-order quadword from xmmi 
and xmm2/m 128 Into xmm 1 register. 

VEX.NDS.128.66.0F.WIG 60/r 

VPUNPCKLBW xmmi,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Interleave low-order bytes from xmm2 and 
xmm3/m 128 into xmm 1. 

VEX.NDS.128.66.0F.WIG61/r 

VPUNPCKLWD xmm1,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Interleave low-order words from xmm2 and 
xmm3/m 128 into xmm 1. 

VEX.NDS.128.66.0F.WIG 62/r 

VPUNPCKLDQ xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Interleave low-order doublewords from xmm2 
and xmm3/m 128 Into xmmi. 

VEX.NDS.128.66.0F.WIG 6C/r 

VPUNPCKLQDQ xmmi, xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Interleave low-order quadword from xmm2 
and xmm3/m 128 Into xmm 1 register. 

VEX.NDS.256.66.0F.WIG 60 /r 

VPUNPCKLBW ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Interleave low-order bytes from ymm2an6 
ymm3/m256 \nto ymmi register. 

VEX.NDS.256.66.0F.WIG 61 /r 

VPUNPCKLWD ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Interleave low-order words from ymm2 and 
ymm3/m256 \nto ymmi register. 

VEX.NDS.256.66.0F.WIG 62 /r 

VPUNPCKLDQ ymmi, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Interleave low-order doublewords from ymm2 
and ymm3/m256 into ymmi register. 

VEX.NDS.256.66.0F.WIG 6C /r 

VPUNPCKLQDQ ymm 1, ymm2, ymm3/m256 

RVM 

v/v 

AVX2 

Interleave low-order quadword from ymm2 
and ymm3/m256 Into ymmi register. 

EVEX.NDS.128.66.0F.WIG 60 /r 

VPUNPCKLBW xmmi [k1}[z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Interleave low-order bytes from xmm2 and 
xmm3/m128 Into xmmi register subject to 
write mask k1. 

EVEX.NDS.128.66.0F.WIG 61 /r 

VPUNPCKLWD xmmi [k1 }[z}, xmm2, xmm3/m128 

FVM 

v/v 

AVX512VL 

AVX512BW 

Interleave low-order words from xmm2 and 
xmm3/m128 Into xmmi register subject to 
write mask k1. 

EVEX.NDS.128.66.0F.W0 62 /r 

VPUNPCKLDQ xmmi {k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Interleave low-order doublewords from xmm2 
and xmm3/m128/m32bcst Into xmmi 
register subject to write mask k1. 

EVEX.NDS.128.66.0F.W1 6C/r 

VPUNPCKLQDQ xmmi {k1}{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Interleave low-order quadword from zmm2 
and zmm3/m512/m64bcst into zmmi 
register subject to write mask k1. 
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EVEX.NDS.256.66.0F.WIG 60 /r 

VPUNPCKLBW ymmi {k1 }[z}, ymm2, ymm3/m256 

FVM 

V/V 

AVX512VL 

AVX512BW 

Interleave low-order bytes from ymmZ and 
ymm3/m256 into ymmi register subject to 
write mask k1. 

EVEX.NDS.256.66.0F.WIC 61 /r 

VPUNPCKLWD ymmi [k1}[z}, ymm2, ymm3/m256 

FVM 

V/V 

AVX512VL 

AVX512BW 

Interleave low-order words from ymmZ and 
ymm3/m256 into ymmi register subject to 
write mask k1. 

EVEX.NDS.256.66.0F.W0 62 /r 

VPUNPCKLDQymmI [k1 }[z}, ymm2, 
ymm3/m256/m32bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Interleave low-order doublewords from ymmZ 
and ymm3/m256/m32bcst into ymmi 
register subject to write mask k1. 

EVEX.NDS.256.66.0F.W1 6C/r 

VPUNPCKLQDQymmI [k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

V/V 

AVX512VL 

AVX512F 

Interleave low-order quadword from ymmZ 
and ymm3/m256/m64bcst into ymmi 
register subject to write mask k1. 

EVEX.NDS.512.66.0F.WIG60/r 

VPUNPCKLBW zmmi [k1}[z}, zmm2, zmm3/m512 

FVM 

V/V 

AVX512BW 

Interleave low-order bytes from zmm2 and 
zmm3/m512 into zmmi register subject to 
write mask k1. 

EVEX.NDS.512.66.0F.WIG61/r 

VPUNPCKLWD zmmi [k1}[z}, zmm2, zmm3/m512 

FVM 

V/V 

AVX512BW 

Interleave low-order words from zmm2 and 
zmm3/m512 into zmmi register subject to 
write mask k1. 

EVEX.NDS.512.66.0F.W0 62 /r 

VPUNPCKLDQzmmI [k1 }{z}, zmm2, 
zmm3/m512/m32bcst 

FV 

V/V 

AVX512F 

Interleave low-order doublewords from zmm2 
and zmm3/m512/m32bcst into zmmi 
register subject to write mask k1. 

EVEX.NDS.512.66.0F.W1 6C/r 

VPUNPCKLQDQzmmI {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

V/V 

AVX512F 

Interleave low-order quadword from zmm2 
and zmm3/m512/m64bcst into zmmi 
register subject to write mask k1. 


NOTES: 

1. See note In Section Z.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FVM 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Unpacks and interleaves the low-order data elements (bytes, words, doublewords, and quadwords) of the destina¬ 
tion operand (first operand) and source operand (second operand) into the destination operand. (Figure 4-22 
shows the unpack operation for bytes in 64-bit operands.). The high-order data elements are ignored. 
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Figure 4-22. PUNPCKLBW Instruction Operation Using 64-bit Operands 


255 31 0 255 31 0 



Figure 4-23. 256-bit VPUNPCKLDQ Instruction Operation 


When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 
64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced. 

The (V)PUNPCKLBW instruction interleaves the low-order bytes of the source and destination operands, the 
(V)PUNPCKLWD instruction interleaves the low-order words of the source and destination operands, the 
(V)PUNPCKLDQ instruction interleaves the low-order doubleword (or doublewords) of the source and destination 
operands, and the (V)PUNPCKLQDQ instruction interleaves the low-order quadwords of the source and destination 
operands. 

These instructions can be used to convert bytes to words, words to doublewords, doublewords to quadwords, and 
quadwords to double quadwords, respectively, by placing all Os in the source operand. Here, if the source operand 
contains all Os, the result (stored in the destination operand) contains zero extensions of the high-order data 
elements from the original value in the destination operand. For example, with the (V)PUNPCKLBW instruction the 
high-order bytes are zero extended (that is, unpacked into unsigned word integers), and with the (V)PUNPCKLWD 
instruction, the high-order words are zero extended (unpacked into unsigned doubleword integers). 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 32-bit memory 
location. The destination operand is an MMX technology register. 

128-bit Legacy SSE versions: The second source operand is an XMM register or a 128-bit memory location. The 
first source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM 
destination register remain unchanged. 

VEX. 128 encoded versions: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 

VEX.256 encoded version: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corresponding ZMM 
register are zeroed. 
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EVEX encoded VPUNPCKLDQ/QDQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit 
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source 
operand and destination operands are ZMM/YMM/XMM registers. The destination is conditionally updated with 
writemask kl. 

EVEX encoded VPUNPCKLWD/BW: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit 
memory location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination 
is conditionally updated with writemask kl. 

Operation 

PUNPCKLBW instruction with 64-bit operands: 

DEST[63:56]^SRC[31:24]; 

DEST[55:48]^DEST[31:24]; 

DEST[47:40]^SRC[23:16]; 

DEST[39:32]^DEST[23:16]; 

DEST[31:24]^SRC[15:8]; 

DEST[23:16]^DEST[15:8]; 

DEST[15:8] ^ SRC[7:0]; 

DEST[7:0] ^ DEST[7:0]; 

PUNPCKLWD instruction with 64-bit operands: 

DEST[63:48]^SRC[31:16]; 

DEST[47:32]^DEST[31:16]; 

DEST[31:16]^SRC[15:0]; 

DEST[15:0] ^ DEST[15:0]; 

PUNPCKLDQ instruction with 64-bit operands: 

DEST[63:32]^SRC[31:0]; 

DEST[31:0]^DEST[31:0]; 

INTERLEAVE_BYTES_51 2b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_BYTES_256b(SRC1 [255:0], SRC[255:0]) 

TMP_DEST[511:256] <- INTERLEAVE_BYTES_256b(SRC1 [511:256], SRC[511:256]) 

INTERLEAVE_BYTES_256b (SRC1, SRC2) 

DEST[7:0]<-SRC1[7:0] 

DEST[15:8] <- SRC2[7:0] 

DEST[23:16]^SRC1[15:8] 

DEST[31:24] ^SRC2[15:8] 

DEST[39:32]^SRC1[23:16] 

DEST[47:40]^SRC2[23:16] 

DEST[55:48]^SRC1 [31:24] 

DEST[63:56] ^SRC2[31:24] 

DEST[71:64]^SRC1 [39:32] 

DEST[79:72] <- SRC2[39:32] 

DEST[87:80]^SRC1 [47:40] 

DEST[95:88] <- SRC2[47:40] 

DEST[103:96] ^SRCI [55:48] 

DEST[111:104] ^SRC2[55:48] 

DEST[119:112] ^SRCI [63:56] 

DEST[127:120] <- SRC2[63:56] 

DEST[135:128] <- SRC1 [135:128] 

DEST[143:136] <- SRC2[135:128] 

DEST[151:144] <- SRC1 [143:136] 

DEST[159:152] <- SRC2[143:136] 

DEST[167:160] i- SRC1 [151:144] 


4-504 Vol. 2B 


PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ-Unpack Low Data 


INSTRUCTION SET REFERENCE, M-U 


DEST[175:168] ^SRC2[151:144] 

DEST[183:176] <- SRC1 [159:152] 

DEST[191:184] <- SRC2[159:152] 

DEST[199:192] <- SRC1 [167:160] 

DEST[207:200] <- SRC2[167:160] 

DEST[215:208] <- SRC1 [175:168] 

DEST[223:216] <- SRC2[175:168] 

DEST[231:224] ^SRCI [183:176] 

DEST[239:232] <- SRC2[183:176] 

DEST[247:240] <- SRC1 [191:184] 

DEST[255:248] <- SRC2[191:184] 

INTERLEAVE_BYTES (SRC1, SRC2) 

DEST[7:0]^SRC1[7:0] 

DEST[15:8] <- SRC2[7:0] 

DEST[23:16] ^SRC2[15:8] 

DEST[31:24] ^SRC2[15:8] 

DEST[39:32] ^SRC1[23:16] 

DEST[47:40] ^SRC2[23:16] 

DEST[55:48] ^SRCI [31:24] 

DEST[63:56] ^SRC2[31:24] 

DEST[71:64] ^SRCI [39:32] 

DEST[79:72] <- SRC2[39:32] 

DEST[87:80] <- SRC1 [47:40] 

DEST[95:88] <- SRC2[47:40] 

DEST[103:96] ^SRCI [55:48] 

DEST[111:104] ^SRC2[55:48] 

DEST[119:112] ^SRCI [63:56] 

DEST[127:120] <- SRC2[63:56] 

INTERLEAVE_W0RDS_512b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_W0RDS_256b(SRC1 [255:0], SRC[255:0]) 
TMP_DEST[511:256] <- INTERLEAVE_W0RDS_256b(SRC1 [511:256], SRC[511:256]) 

INTERLEAVE_W0RDS_256b(SRC1, SRC2) 

DEST[15:0] ^SRCI [15:0] 

DEST[31:16] ^SRC2[15:0] 

DEST[47:32] ^SRC1[31:16] 

DEST[63:48] ^SRC2[31:16] 

DEST[79:64] <- SRC1 [47:32] 

DEST[95:80] <- SRC2[47:32] 

DEST[111:96] ^SRCI [63:48] 

DEST[127:112] <- SRC2[63:48] 

DEST[143:128] <- SRC1 [143:128] 

DEST[159:144] <- SRC2[143:128] 

DEST[175:160] <- SRC1 [159:144] 

DEST[191:176] ^SRC2[159:144] 

DEST[207:192] <- SRC1 [175:160] 

DEST[223:208] <- SRC2[175:160] 

DEST[239:224] <- SRC1 [191:176] 

DEST[255:240] <- SRC2[191:176] 

INTERLEAVE_WORDS (SRC1, SRC2) 

DEST[15:0] ^SRCI [15:0] 
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DEST[31:16] ^SRC2[15:0] 

DEST[47:32] ^SRC1[31:16] 

DEST[63:48] ^SRC2[31:16] 

DEST[79:64] ^SRCI [47:32] 

DEST[95:80] <- SRC2[47:32] 

DEST[111:96]^SRC1 [63:48] 

DEST[127:112] <- SRC2[63:48] 

INTERLEAVE_DWORDS_51 2b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_DWORDS_256b(SRC1 [255:0], SRC2[255:0]) 
TMP_DEST[511:256] <- INTERLEAVE_DW0RDS_256b(SRC1 [511:256], SRC2[511:256]) 

INTERLEAVE_DWORDS_256b(SRC1, SRC2) 

DEST[31:0]^SRC1[31:0] 

DEST[63:32] ^SRC2[31:0] 

DEST[95:64] ^SRCI [63:32] 

DEST[127:96] ^SRC2[63:32] 

DEST[159:128] <- SRC1 [159:128] 

DEST[191:160] <- SRC2[159:128] 

DEST[223:192] <- SRC1 [191:160] 

DEST[255:224] <- SRC2[191:160] 

INTERLEAVE_DW0RDS(SRC1, SRC2) 

DEST[31:0]^SRC1[31:0] 

DEST[63:32] ^SRC2[31:0] 

DEST[95:64] ^SRCI [63:32] 

DEST[127:96] ^SRC2[63:32] 

INTERLEAVE_QW0RDS_512b (SRC1, SRC2) 

TMP_DEST[255:0] <- INTERLEAVE_QWORDS_256b(SRC1 [255:0], SRC2[255:0]) 
TMP_DEST[511:256] <- INTERLEAVE_QW0RDS_256b(SRC1 [511:256], SRC2[511:256]) 

INTERLEAVE_QW0RDS_256b(SRC1, SRC2) 

DEST[63:0]^SRC1[63:0] 

DEST[127:64] <- SRC2[63:0] 

DEST[191:128] <- SRC1 [191:128] 

DEST[255:192] <- SRC2[191:128] 

INTERLEAVE_QW0RDS(SRC1, SRC2) 

DEST[63:0]^SRC1[63:0] 

DEST[127:64] <- SRC2[63:0] 

PUNPCKLBW 

DEST[127:0] ^INTERLEAVE_BYTES(DEST, SRC) 

DEST[255:127] (Unmodified) 


VPUNPCKLBW (VEX.128 encoded instruction) 

DEST[127:0] ^INTERLEAVE_BYTES(SRC1, SRC2) 
DEST[511:127] ^0 


VPUNPCKLBW (VEX.256 encoded instruction) 

DEST[255:0] ^INTERLEAVE_BYTES_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256]^0 
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VPUNPCKLBW (EVEX.512 encoded instruction) 

(KL, VL) = (16, 128), (32, 256), (64, 512) 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_BYTES(SRC1 [VL-1:0], SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_BYTES_256b(SRC1[VL-1:0], SRC2[VL-1:0]) 
FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_BYTES_512b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

FORj^OTO KL-1 
i ^j*8 

IF k10] OR *no wrltemask* 

THEN DEST[i+7:l] ^ TMP_DEST[l+7:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+7:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+7:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

DEST[511:0] ^ INTERLEAVE_BYTES_512b(SRC1, SRC2) 

PUNPCKLWD 

DEST[127:0] ^INTERLEAVE_WORDS(DEST, SRC) 

DEST[255:127] (Unmodified) 


VPUNPCKLWD (VEX.128 encoded instruction) 

DEST[127:0] ^INTERLEAVE_WORDS(SRC1, SRC2) 
DEST[511:127] ^0 


VPUNPCKLWD (VEX.256 encoded instruction) 

DEST[255:0] ^INTERLEAVE_WORDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256] ^0 


VPUNPCKLWD (EVEX.512 encoded instruction) 

(KL, VL) = (8,128), (16, 256), (32, 512) 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_WORDS(SRC1 [VL-1:0], SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_WORDS_256b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_WORDS_512b(SRC1 [VL-1:0], SRC2[VL-1:0]) 
FI; 

FOR] ^0 TO KL-1 
i ^j* 16 

IF k10] OR *no wrltemask* 


PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ-Unpack Low Data 


Vol. 2B 4-507 


INSTRUCTION SET REFERENCE, M-U 


THEN DEST[l+15:i] ^ TMP_DEST[I+15:I] 
ELSE 


IF *merglng-masklng* 


; mergIng-maskIng 


THEN *DEST[I+15:1] remains unchanged^ 


ELSE *zerolng-masklng 
DEST[i+15:i]^0 


; zeroing-masking 


FI 


FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

DEST[511:0] ^ INTERLEAVE_WORDS_512b(SRC1, SRC2) 

PUNPCKLDQ 

DEST[127:0] ^INTERLEAVE_DWORDS(DEST, SRC) 

DEST[MAX_VL-1:128] (Unmodified) 

VPUNPCKLDQ (VEX.128 encoded instruction) 

DEST[127:0] ^INTERLEAVE_DWORDS(SRC1, SRC2) 

DEST[MAX_VL-1:128]^0 

VPUNPCKLDQ (VEX.256 encoded instruction) 

DEST[255:0] ^INTERLEAVE_DW0RDS_256b(SRC1, SRC2) 

DEST[MAX_VL-1:256] ^0 

VPUNPCKLDQ (EVEX encoded instructions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^]*32 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN TMP_SRC2[l+31:i] ^ SRC2[31:0] 

ELSE TMP_SRC2[I+31 :l] ^ SRC2[i+31 :i] 

FI; 

ENDFOR; 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_DW0RDS(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_DW0RDS_256b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 
FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_DW0RDS_512b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 
FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 


IF *merglng-masklng 


; mergIng-maskIng 


THEN *DEST[I+31 :l] remains unchanged^ 


ELSE *zerolng-masklng 
DEST[i+31:i]^0 


; zeroing-masking 


FI 


FI; 
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ENDFOR 

DEST511:0] ^INTERLEAVE_DWORDS_512b(SRC1, SRC2) 
DEST[MAX_VL-1:VL]^0 

PUNPCKLQDQ 

DEST[127:0] ^INTERLEAVE_QWORDS(DEST, SRC) 
DEST[MAX_VL-1:128] (Unmodified) 


VPUNPCKLQDQ (UEX.128 encoded instruction) 

DEST[127:0] ^INTERLEAVE_QWORDS(SRC1, SRC2) 
DEST[MAX_VL-1:128]^0 


VPUNPCKLQDQ (VEX.256 encoded instruction) 

DEST[255:0] ^INTERLEAVE_QW0RDS_256b(SRC1, SRC2) 
DEST[MAX_VL-1:256]^0 


VPUNPCKLQDQ (EVEX encoded instructions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF (EVEX.b = 1) AND (SRC2 *is memory*) 

THEN TMP_SRC2[i+63:i] ^ SRC2[63:0] 

ELSE TMP_SRC2[i+63:i] ^ SRC2[i+63:i] 

FI; 

ENDFOR; 

IFVL= 128 

TMP_DEST[VL-1:0] ^ INTERLEAVE_QWORDS(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

IFVL= 256 

TMP_DEST[VL-1:0] ^ INTERLEAVE_QWORDS_256b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

IFVL= 512 

TMP_DEST[VL-1:0] ^ INTERLEAVE_QWORDS_512b(SRC1 [VL-1:0], TMP_SRC2[VL-1:0]) 

FI; 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k10] OR *no writemask* 

THEN DEST[i+63:i] ^ TMP_DEST[i+63:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

Intel C/C-i-i- Compiler Intrinsic Equivalents 

VPUNPCKLBW _m5121 _mm512_unpacklo_epi8(_m512i a, _m5121 b); 

VPUNPCKLBW_m5121 _mm512_mask_unpacklo_epi8(_m5121 s,_mmask64 k,_m5121 a,_m5121 b); 

VPUNPCKLBW_m5121 _mm512_maskz_unpacklo_epi8(_mmask64 k,_m512i a,_m5121 b); 

VPUNPCKLBW_m256i _mm256_mask_unpacklo_epi8(_m256i s,_mmask32 k,_m256i a,_m256i b); 
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VPUNPCKLBW_m256l_mm256_maskz_unpacklo_epl8(_mmask32 k,_m256l a,_m256i b); 

VPUNPCKLBW_ml 281 _mm_mask_unpacklo_epl8(v s,_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPUNPCKLBW_ml 281 _mm_maskz_unpacklo_epl8(_mmaski 6 k,_ml 281 a,_ml 281 b); 

VPUNPCKLWD _m512i _mm512_unpacklo_epi16(_m5121 a, _m5121 b); 

VPUNPCKLWD_m512i_mm512_mask_unpacklo_epi16(_m512i s,_mmask32 k,_m512i a,_m512l b); 

VPUNPCKLWD_mSI 21 _mm512_maskz_unpacklo_epl16(_mmask32 k,_mSI 21 a,_mSI 21 b); 

VPUNPCKLWD_m256i _mm256_mask_unpacklo_epl16(_m256i s,_mmaski 6 k,_m256i a,_m256l b); 

VPUNPCKLWD_m256l_mm256_maskz_unpacklo_epl16(_mmaski 6 k,_m256l a,_m256l b); 

VPUNPCKLWD_ml 28i_mm_mask_unpacklo_epl16(v s,_mmaskS k,_ml 28i a,_ml 281 b); 

VPUNPCKLWD_ml 281 _mm_maskz_unpacklo_epl16(_mmaskS k,_ml 281 a,_ml 281 b); 

VPUNPCKLDQ_m5121 _mm512_unpacklo_epi32(_m5121 a_mSI 21 b); 

VPUNPCKLDQ_m5121 _mm512_mask_unpacklo_epl32(_m512i s,_mmaski 6 k,_m512l a,_m512l b); 

VPUNPCKLDQ_mSI 21 _mm512_maskz_unpacklo_epi32(_mmaski 6 k,_mSI 2i a,_mSI 21 b); 

VPUNPCKLDQ_m256l _mm256_mask_unpacklo_epl32(_m256i s,_mmaskS k,_m256i a,_m256l b); 

VPUNPCKLDQ_m256l _mm256_maskz_unpacklo_epi32(_mmaskS k,_m256l a,_m256l b); 

VPUNPCKLDQ_ml 281 _mm_mask_unpacklo_epl32(v s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPUNPCKLDQ_ml 281 _mm_maskz_unpacklo_epl32(_mmaskS k,_ml 281 a,_ml 281 b); 

VPUNPCKLQDQ_m512l_mm512_unpacklo_epl64(_m512i a,_m512l b); 

VPUNPCKLQDQ_mSI 21 _mm512_mask_unpacklo_epl64(_m512i s,_mmaskS k,_m512i a,_m512l b); 

VPUNPCKLQDQ_mSI 21 _mm512_maskz_unpacklo_epl64(_mmaskS k,_m512i a,_m512i b); 

VPUNPCKLQDQ_m256l _mm256_mask_unpacklo_epl64(_m256i s,_mmaskS k,_m256i a,_m256l b); 

VPUNPCKLQDQ_m256l _mm256_maskz_unpacklo_epl64(_mmaskS k,_m256i a,_m256i b); 

VPUNPCKLQDQ_ml 281 _mm_mask_unpacklo_epl64(_ml 281 s,_mmaskS k,_ml 281 a,_ml 281 b); 

VPUNPCKLQDQ_ml 281 _mm_maskz_unpacklo_epl64(_mmaskS k,_ml 281 a,_ml 281 b); 

PUNPCKLBW:_m64_mm_unpacklo_pl8 (_m64 ml,_m64 m2) 

(V)PUNPCKLBW:_m1281 _mm_unpacklo_epi8 (_m1281 ml, _m1281 m2) 
VPUNPCKLBW:_m256i_mm256_unpacklo_epi8 (_m256i m1,_m256l m2) 

PUNPCKLWD:_m64_mm_unpacklo_pl16 (_m64 ml,_m64 m2) 

(V)PUNPCKLWD:_m128l_mm_unpacklo_epi16 (_m128i m1,_m128l m2) 
VPUNPCKLWD:_m256i_mm256_unpacklo_epl16 (_m256i m1,_m256l m2) 

PUNPCKLDQ:_m64_mm_unpacklo_pl32 (_m64 ml,_m64 m2) 

(V)PUNPCKLDQ:_m1281 _mm_unpacklo_epl32 (_m1281 ml, _m1281 m2) 
VPUNPCKLDQ:_m256l_mm256_unpacklo_epi32 (_m256l m1,_m256i m2) 

(V)PUNPCKLQDQ:_m128i _mm_unpacklo_epl64 (_m1281 m1,_m1281 m2) 
VPUNPCKLQDQ:_m256i_mm256_unpacklo_epl64 (_m256l m1,_m256l m2) 


Flags Affected 

None. 


Numeric Exceptions 

None. 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded VPUNPCKLDQ/QDQ, see Exceptions Type E4NF. 
EVEX-encoded VPUNPCKLBW/WD, see Exceptions Type E4NF.nb. 
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PUSH—Push Word, Doubleword or Quadword Onto the Stack 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

FF /6 

PUSH r/m 7 6 

M 

Valid 

Valid 

Push r/m 7 6. 

FF /6 

PUSH r/m32 

M 

N.E. 

Valid 

Push r/m32. 

FF /6 

PUSH r/m64 

M 

Valid 

N.E. 

Push r/m64. 

50+rw 

PUSH r16 

0 

Valid 

Valid 

Pushr76. 

50+rd 

PUSH r32 

0 

N.E. 

Valid 

Push r32. 

50+rd 

PUSH r64 

0 

Valid 

N.E. 

Push r64. 

6A lb 

PUSH imm8 

1 

Valid 

Valid 

Push imm8. 

68 Iw 

PUSH/mm 7 6 

1 

Valid 

Valid 

Push imm 7 6. 

68 Id 

PUSH imm32 

1 

Valid 

Valid 

Push imm32. 

OE 

PUSH CS 

NP 

Invalid 

Valid 

Push CS. 

16 

PUSH SS 

NP 

Invalid 

Valid 

Push SS. 

IE 

PUSH DS 

NP 

Invalid 

Valid 

Push DS. 

06 

PUSH ES 

NP 

Invalid 

Valid 

Push ES. 

OF AO 

PUSH FS 

NP 

Valid 

Valid 

Push FS. 

OF A8 

PUSH CS 

NP 

Valid 

Valid 

Push CS. 


NOTES: 

* See IA-32 Architecture Compatibility section below. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r) 

NA 

NA 

NA 

0 

opcode + rd (r) 

NA 

NA 

NA 

1 

imm8/16/32 

NA 

NA 

NA 

NP 

NA 

NA 

NA 

NA 


Description 

Decrements the stack pointer and then stores the source operand on the top of the stack. Address and operand 
sizes are determined and used as follows: 

• Address size. The D flag in the current code-segment descriptor determines the default address size; it may be 
overridden by an instruction prefix (67H). 

The address size is used only when referencing a source operand in memory. 

• Operand size. The D flag in the current code-segment descriptor determines the default operand size; it may 
be overridden by instruction prefixes (66H or REX.W). 

The operand size (16, 32, or 64 bits) determines the amount by which the stack pointer is decremented (2, 4 
or 8). 

If the source operand is an immediate of size less than the operand size, a sign-extended value is pushed on 
the stack. If the source operand is a segment register (16 bits) and the operand size is 64-bits, a zero- 
extended value is pushed on the stack; if the operand size is 32-bits, either a zero-extended value is pushed 
on the stack or the segment selector is written on the stack using a 16-bit move. For the last case, all recent 
Core and Atom processors perform a 16-bit move, leaving the upper portion of the stack location unmodified. 

• Stack-address size. Outside of 64-bit mode, the B flag in the current stack-segment descriptor determines the 
size of the stack pointer (16 or 32 bits); in 64-bit mode, the size of the stack pointer is always 64 bits. 
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The stack-address size determines the width of the stack pointer when writing to the stack in memory and 
when decrementing the stack pointer. (As stated above, the amount by which the stack pointer is 
decremented is determined by the operand size.) 

If the operand size is less than the stack-address size, the PUSH instruction may result in a misaligned stack 
pointer (a stack pointer that is not aligned on a doubleword or quadword boundary). 

The PUSH ESP instruction pushes the value of the ESP register as it existed before the instruction was executed. If 
a PUSH instruction uses a memory operand in which the ESP register is used for computing the operand address, 
the address of the operand is computed before the ESP register is decremented. 

If the ESP or SP register is 1 when the PUSH instruction is executed in real-address mode, a stack-fault exception 
(#SS) is generated (because the limit of the stack segment is violated). Its delivery encounters a second stack- 
fault exception (for the same reason), causing generation of a double-fault exception (#DF). Delivery of the 
double-fault exception encounters a third stack-fault exception, and the logical processor enters shutdown mode. 
See the discussion of the double-fault exception in Chapter 6 of the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 3A. 

IA-32 Architecture Compatibility 

For IA-32 processors from the Intel 286 on, the PUSH ESP instruction pushes the value of the ESP register as it 
existed before the instruction was executed. (This is also true for Intel 64 architecture, real-address and virtual- 
8086 modes of IA-32 architecture.) For the Intel® 8086 processor, the PUSH SP instruction pushes the new value 
of the SP register (that is the value after it has been decremented by 2). 

Operation 

(* See Description section for possible sign-extension or zero-extension of source operand and for *) 

(* a case in which the size of the memory store may be smaller than the instruction's operand size *) 

IF StackAddrSize = 64 
THEN 

IF OperandSize = 64 
THEN 

RSP ^ RSP - 8; 

Memory[SS:RSP] ^ SRC; 

ELSE IF OperandSize = 32 
THEN 

RSP ^ RSP - 4; 

Memory[SS:RSP] ^ SRC; 

ELSE (* OperandSize = 16*) 

RSP ^ RSP - 2; 

Memory[SS:RSP] ^ SRC; 

FI; 

ELSE IF StackAddrSize = 32 
THEN 

IF OperandSize = 64 
THEN 

ESP ^ ESP - 8; 

Memory[SS:ESP] SRC; 

ELSE IF OperandSize = 32 
THEN 

ESP ^ ESP - 4; 

Memory[SS:ESP] ^ SRC; 

ELSE (* OperandSize = 16*) 

ESP ^ ESP - 2; 

Memory[SS:ESP] ^ SRC; 

FI; 

ELSE (* StackAddrSize =16*) 


(* push quadword *) 


(* push dword *) 

(* push word *) 


(* push quadword *) 


(* push dword *) 

(* push word *) 
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IF OperandSIze = 32 
THEN 

SP ^ SP - 4; 

Memory[SS:SP] ^ SRC; 

ELSE (* OperandSIze =16*) 

SP ^ SP - 2; 

Memory[SS:SP] ^ SRC; 

FI; 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

If the new value of the SP or ESP register is outside the stack segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 

#SS(0) If the stack address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

If the PUSH is of CS, SS, DS, or ES. 


(* push dword *) 

(* push word *) 
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PUSHA/PUSHAD- 

Push All General-Pur 

pose Registers 

Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

60 

PUSHA 

NP 

Invalid 

Valid 

Push AX, CX, DX, BX, original SP, BP, SI, and DI. 

60 

PUSHAD 

NP 

Invalid 

Valid 

Push EAX, ECX, EDX, EBX, original ESP, EBP, 

ESI, and EDI. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Pushes the contents of the general-purpose registers onto the stack. The registers are stored on the stack in the 
following order: EAX, ECX, EDX, EBX, ESP (original value), EBP, ESI, and EDI (if the current operand-size attribute 
is 32) and AX, CX, DX, BX, SP (original value), BP, SI, and DI (if the operand-size attribute is 16). These instruc¬ 
tions perform the reverse operation of the POPA/POPAD instructions. The value pushed for the ESP or SP register 
is its value before prior to pushing the first register (see the "Operation" section below). 

The PUSHA (push all) and PUSHAD (push all double) mnemonics reference the same opcode. The PUSHA instruc¬ 
tion is intended for use when the operand-size attribute is 16 and the PUSHAD instruction for when the operand- 
size attribute is 32. Some assemblers may force the operand size to 16 when PUSHA is used and to 32 when 
PUSHAD is used. Others may treat these mnemonics as synonyms (PUSHA/PUSHAD) and use the current setting 
of the operand-size attribute to determine the size of values to be pushed from the stack, regardless of the 
mnemonic used. 

In the real-address mode, if the ESP or SP register is 1, 3, or 5 when PUSHA/PUSHAD executes: an #SS exception 
is generated but not delivered (the stack error reported prevents #SS delivery). Next, the processor generates a 
#DF exception and enters a shutdown state as described in the #DF discussion in Chapter 6 of the Intel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 3A. 

This instruction executes as described in compatibility mode and legacy mode. It is not valid in 64-bit mode. 

Operation 

IF 64-blt Mode 

THEN #UD 


FI; 

IF OperandSIze = 32 (* PUSHAD instruction *) 

THEN 

Temp ^ (ESP); 

Push(EAX); 

Push(ECX); 

Push(EDX); 

Push(EBX); 

Push(Temp); 

Push(EBP); 

Push(ESI); 

Push(EDI); 

ELSE (* OperandSIze = 16, PUSHA Instruction *) 
Temp ^ (SP); 

Push(AX); 

Push(CX); 

Push(DX); 
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Push(BX); 

Push(Temp); 

Push(BP); 

Push(SI); 

Push(DI); 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#SS(0) If the starting or ending stack address is outside the stack segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment 

checking is enabled. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If the ESP or SP register contains 7, 9, 11, 13, or 15. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If the ESP or SP register contains 7, 9, 11, 13, or 15. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while alignment checking is enabled. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#UD If in 64-bit mode. 
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PUSHF/PUSHFD-Push EFLAGS Register onto the Stack 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

9C 

PUSHF 

NP 

Valid 

Valid 

Push lower 16 bits of EFLAGS. 

9C 

PUSHED 

NP 

N.E. 

Valid 

Push EFLAGS. 

9C 

PUSHFQ 

NP 

Valid 

N.E. 

Push RFLAGS. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Decrements the stack pointer by 4 (if the current operand-size attribute is 32) and pushes the entire contents of 
the EFLAGS register onto the stack, or decrements the stack pointer by 2 (if the operand-size attribute is 16) and 
pushes the lower 16 bits of the EFLAGS register (that is, the FLAGS register) onto the stack. These instructions 
reverse the operation of the POPF/POPFD instructions. 

When copying the entire EFLAGS register to the stack, the VM and RF flags (bits 16 and 17) are not copied; instead, 
the values for these flags are cleared in the EFLAGS image stored on the stack. See Chapter 3 of the I ntel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 1, for more information about the EFLAGS register. 

The PUSHF (push flags) and PUSHED (push flags double) mnemonics reference the same opcode. The PUSHF 
instruction is intended for use when the operand-size attribute is 16 and the PUSHED instruction for when the 
operand-size attribute is 32. Some assemblers may force the operand size to 16 when PUSHF is used and to 32 
when PUSHED is used. Others may treat these mnemonics as synonyms (PUSHF/PUSHFD) and use the current 
setting of the operand-size attribute to determine the size of values to be pushed from the stack, regardless of the 
mnemonic used. 

In 64-bit mode, the instruction's default operation is to decrement the stack pointer (RSP) by 8 and pushes RFLAGS 
on the stack. 16-bit operation is supported using the operand size override prefix 66H. 32-bit operand size cannot 
be encoded in this mode. When copying RFLAGS to the stack, the VM and RF flags (bits 16 and 17) are not copied; 
instead, values for these flags are cleared in the RFLAGS image stored on the stack. 

When in virtual-8086 mode and the I/O privilege level (lOPL) is less than 3, the PUSHF/PUSHFD instruction causes 
a general protection exception (#GP). 

In the real-address mode, if the ESP or SP register is 1 when PUSHF/PUSHFD instruction executes: an #SS excep¬ 
tion is generated but not delivered (the stack error reported prevents #SS delivery). Next, the processor generates 
a #DF exception and enters a shutdown state as described in the #DF discussion in Chapter 6 of the I ntel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 3A. 

Operation 

IF (PE = 0) or (PE = 1 and ((VM = 0) or (VM = 1 and lOPL = 3))) 

(* Real-Address Mode, Protected mode, or Vlrtual-8086 mode with lOPL equal to 3 *) 

THEN 

IF OperandSlze= 32 
THEN 

push (EFLAGS AND OOFCFFFFH); 

(* VM and RF EFLAC bits are cleared In image stored on the stack *) 

ELSE 

push (EFLAGS); (* Lower 16 bits only *) 

FI; 

ELSE IF 64-bit MODE (* In 64-bit Mode *) 

IF 0perandSize = 64 
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THEN 

push (RFLAGS AND 00000000_00FCFFFFH); 

(* VM and RF RFLAG bits are cleared in image stored on the stack; *) 

ELSE 

push (EFLAGS); (* Lower 16 bits only *) 

FI; 

ELSE (* In Virtual-8086 Mode with lOPL less than 3 *) 

#GP(0); (* Trap to virtual-8086 monitor *) 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#SS(0) If the new value of the ESP register is outside the stack segment boundary. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment 

checking is enabled. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If the I/O privilege level is less than 3. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while alignment checking is enabled. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 

#SS(0) If the stack address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory reference is made while the current privilege level is 3 and alignment 

checking is enabled. 

#UD If the LOCK prefix is used. 
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PXOR—Logical Exclusive OR 


Opcode*/ 

Instruction 

Op/ 

Gn 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF EF /r' 

PXOR mm, mm/m64 

RM 

V/V 

MMX 

Bitwise XOR of mm/m64 and mm. 

66 OF EF /r 

PXOR xmm 1, xmm2/m 128 

RM 

v/v 

SSE2 

Bitwise XOR of xmm2/ml28and xmmi. 

VEX.NDS.128.66.0F.WIGEF /r 

VPXOR xmmi, xmm2, xmm3/m128 

RVM 

V/V 

AVX 

Bitwise XOR of xmm3/m 128 and xmmZ. 

VEX.NDS.256.66.0F.WIG EF /r 

VPXOR ymm 1, ymmZ, \/mm3/m256 

RVM 

v/v 

AVX2 

Bitwise XOR of ymm3/m256 and \)mm2. 

EVEX.NDS.1 28.66.0F.W0 EF /r 

VPXORDxmmI [k1}{z},xmm2,xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Bitwise XOR of packed doubleword integers in 
xmm2 and xmm3/m128 using writemask k1. 

EVEX.NDS.256.66.0F.W0 EF /r 

VPXORD ymmi [k1 }[z}, ymm2, ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Bitwise XOR of packed doubleword integers in 
ymm2 and ymm3/m256 using writemask k1. 

EVEX.NDS.512.66.0F.W0 EF /r 

VPXORD zmmi {k1}{z}, zmm2, zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Bitwise XOR of packed doubleword integers in 
zmm2 and zmm3/m512/m32bcst using 
writemask k1. 

EVEX.NDS.1 28.66.0F.W1 EF /r 

VPXORQxmmI {k1}{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Bitwise XOR of packed guadword integers in 
xmm2 and xmm3/m128 using writemask k1. 

EVEX.NDS.256.66.0F.W1 EF /r 

VPXORQ ymmi {k1 }{z}, ymm2, ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Bitwise XOR of packed guadword integers in 
ymm2 and ymm3/m256 using writemask k1. 

EVEX.NDS.512.66.0F.W1 EF/r 

VPXORQ zmmi {k1 ]{z], zmm2, zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Bitwise XOR of packed guadword integers in 
zmm2 and zmm3/m512/m64bcst using 
writemask k1. 


NOTES: 

1. See note In Section Z.4, "AVX and SSE Instruction Exception Specification" in the Inter 64 and IA-32 Architectures Software 
Developer's Manual, Volume ZA and Section 22.25.3, "Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers" 
in the Inter 64 and IA-32 Architectures Software Developer's Manual, Volume 3A. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vwv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a bitwise logical exclusive-OR (XOR) operation on the source operand (second operand) and the destina¬ 
tion operand (first operand) and stores the result in the destination operand. Each bit of the result is 1 if the corre¬ 
sponding bits of the two operands are different; each bit is 0 if the corresponding bits of the operands are the same. 

In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to 
access additional registers (XMM8-XMM15). 

Legacy SSE instructions 64-bit operand: The source operand can be an MMX technology register or a 64-bit 
memory location. The destination operand is an MMX technology register. 
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128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the corresponding VMM desti¬ 
nation register remain unchanged. 

VEX. 128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first 
source operand and destination operands are XMM registers. Bits (VLMAX-1:128) of the destination VMM register 
are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM register 
or a 256-bit memory location. The destination operand is a VMM register. The upper bits (MAX_VL-1:256) of the 
corresponding register destination are zeroed. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32/64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with 
writemask kl. 

Operation 

PXOR (64-bit operand) 

BEST ^ BEST XOR SRC 

PXOR (128-bit Legacy SSE version) 

BEST ^ BEST XOR SRC 
BEST[VLMAX-1:128] (Unmodified) 

VPXOR (VEX.128 encoded version) 

BEST ^ SRC1 XOR SRC2 
0EST[VLMAX-1:128)^0 

VPXOR (VEX.256 encoded version) 

BEST ^ SRC1 XOR SRC2 
0EST[VLMAX-1:256]^0 

VPXORD (EVEX encoded versions) 

(KL, VL) = (4,1 28), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i ^j*32 

IF kl 0] OR *no writemask* THEN 

IF (EVEX.b = 1) ANB (SRC2 *is memory*) 

THEN BEST[i+31 :i] ^ SRC1 [i+31 :i] BITWISE XOR SRC2[31:0] 

ELSE 0EST[i+31 :i] ^ SRC1 [i+31 :i] BITWISE XOR SRC2[i+31 :i] 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *0EST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

0EST[31:0]^0 
FI; 

FI; 

ENBFOR; 

0EST[MAX_VL-1:VL]^0 
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VPXORQ (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC2 *ls memory*) 

THEN DEST[l+63:i] ^ SRC1 [l+63:i] BITWISE XOR SRC2[63:0] 
ELSE DEST[i+63:i] ^ SRC1 [i+63:i] BITWISE XOR SRC2[i+63:l] 
FI; 

ELSE 

IF *merglng-masklng* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[63:0] ^ 0 
FI; 

FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 


Intel C/C-r-r Compiler Intrinsic Equivalent 

VPXORD _m5121 _mm512_xor_epl32(_m512i a, _m5121 b) 

VPXORD_m512i _mm512_mask_xor_epl32(_m512i s,_mmaski 6 m,_m512i a,_m5121 b) 

VPXORD_m512i_mm512_maskz_xor_epi32(_mmaski 6 m,_m512l a,_m512l b) 

VPXORD _m256l _mm256_xor_epl32(_m256l a, _m256l b) 

VPXORD_m256l _mm256_mask_xor_epl32(_m256i s,_mmask8 m,_m256l a,_m256l b) 

VPXORD_m256i _mm256_maskz_xor_epi32(_mmask8 m,_m256l a,_m256i b) 

VPXORD_ml 28i _mm_xor_epi32(_ml 281 a,_ml 281 b) 

VPXORD_ml 281 _mm_mask_xor_epi32(_ml 281 s,_mmask8 m,_ml 281 a,_ml 281 b) 

VPXORD_ml 281 _mm_maskz_xor_epi32(_mmaski 6 m,_ml 281 a,_ml 281 b) 

VPXORQ _m512i_mm512_xor_epl64(_m5121 a,_m5121 b); 

VPXORQ_m512i_mm512_mask_xor_epi64(_m512i s,_mmask8 m,_m512i a,_m512l b); 

VPXORQ_m512i_mm512_maskz_xor_epl64(_mmask8 m,_m512i a,_m512l b); 

VPXORQ_m256i_mm256_xor_epl64(_m256l a,_m256l b); 

VPXORQ_m256l_mm256_mask_xor_epi64(_m256i s,_mmask8 m,_m256i a,_m256l b); 

VPXORQ_m256l _mm256_maskz_xor_epl64(_mmask8 m,_m256i a,_m256l b); 

VPXQRQ_m128l_mm_xor_epi64(_m128i a,_ml 281 b); 

VPXORQ_m128l_mm_mask_xor_epi64(_m128l s,_mmask8 m,_m128l a,_m128i b); 

VPXORQ_m128l_mm_maskz_xor_epl64(_mmask8 m,_m128l a,_m128l b); 

PXQR:_m64 _mm_xor_sl64 (_m64 ml,_m64 m2) 

(V)PXQR:_m1281 _mm_xor_sl128 (_m128i a, _m1281 b) 
VPXOR:_m256l_mm256_xor_sl256 (_m256i a,_m256l b) 


Flags Affected 

None. 


Numeric Exceptions 

None. 


Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4. 
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RCL/RCR/ROL/ROR-Rotate 


Opcode** 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

DO /2 

RCL r/mS, 1 

Ml 

Valid 

Valid 

Rotate 9 bits (CF, r/mS) left once. 

REX + DO /2 

RCL r/mS* 1 

Ml 

Valid 

N.E. 

Rotate 9 bits (CF, r/mS) left once. 

D2 /2 

RCL r/mS, CL 

MC 

Valid 

Valid 

Rotate 9 bits (CF, r/mS) left CL times. 

REX + D2 /2 

RCL r/mS*, CL 

MC 

Valid 

N.E. 

Rotate 9 bits (CF, r/mS) left CL times. 

CO /2 ib 

RCL r/mS, /mmS 

Ml 

Valid 

Valid 

Rotate 9 bits (CF, r/mS) left /mmS times. 

REX + CO /2 ib 

RCL r/mS* imm8 

Ml 

Valid 

N.E. 

Rotate 9 bits (CF, r/mS) left imm8 times. 

D1 /2 

RCL r/m 7 6,1 

Ml 

Valid 

Valid 

Rotate 17 bits (CF, r/m16) left once. 

D3 /2 

RCL r/m 7 6, CL 

MC 

Valid 

Valid 

Rotate 17 bits (CF, r/m16) left CL times. 

Cl /2 ib 

RCL r/m 7 6, /mmS 

Ml 

Valid 

Valid 

Rotate 17 bits (CF, r/ml6) left /mmStimes. 

D1 /2 

RCL r/m32, 1 

Ml 

Valid 

Valid 

Rotate 33 bits (CF, r/m32) left once. 

REX.W + D1 /2 

RCL r/m64,1 

Ml 

Valid 

N.E. 

Rotate 65 bits (CF, r/m64) left once. Uses a 6 
bit count. 

D3 /2 

RCL r/m32. CL 

MC 

Valid 

Valid 

Rotate 33 bits (CF, r/m32) left CL times. 

REX.W + D3 /2 

RCL r/m64, CL 

MC 

Valid 

N.E. 

Rotate 65 bits (CF, r/m64) left CL times. Uses a 

6 bit count. 

Cl /2 ib 

RCL r/m32, imm8 

Ml 

Valid 

Valid 

Rotate 33 bits (CF, r/m32) left /mmS times. 

REX.W + Cl /2 ib 

RCL r/m64, imm8 

Ml 

Valid 

N.E. 

Rotate 65 bits (CF, r/m64) left /mmStimes. 
Uses a 6 bit count. 

DO /3 

RCR r/mS, 1 

Ml 

Valid 

Valid 

Rotate 9 bits (CF, r/mS) right once. 

REX + DO /3 

RCR r/mS*, 1 

Ml 

Valid 

N.E. 

Rotate 9 bits (CF, r/mS) right once. 

D2 /3 

RCR r/mS, CL 

MC 

Valid 

Valid 

Rotate 9 bits (CF, r/mS) right CL times. 

REX + D2 /3 

RCR r/mS*, CL 

MC 

Valid 

N.E. 

Rotate 9 bits (CF, r/mS) right CL times. 

CO /3 ib 

RCR r/mS, imm8 

Ml 

Valid 

Valid 

Rotate 9 bits (CF, r/mS) right /mmS times. 

REX + CO /3 ib 

RCR r/mS* imm8 

Ml 

Valid 

N.E. 

Rotate 9 bits (CF, r/mS) right /mmS times. 

D1 /3 

RCR r/m16, 1 

Ml 

Valid 

Valid 

Rotate 17 bits (CF, r/m 7 6) right once. 

D3 /3 

RCR r/m 7 6, CL 

MC 

Valid 

Valid 

Rotate 17 bits (CF, r/m16) right CL times. 

Cl /3 ib 

RCR r/m 7 6, imm8 

Ml 

Valid 

Valid 

Rotate 17 bits (CF, r/m16) right /mmS times. 

D1 /3 

RCR r/m32,1 

Ml 

Valid 

Valid 

Rotate 33 bits (CF, r/m32) right once. Uses a 6 
bit count. 

REX.W + D1 /3 

RCR r/m64,1 

Ml 

Valid 

N.E. 

Rotate 65 bits (CF, r/m64) right once. Uses a 6 
bit count. 

D3 /3 

RCR r/m32, CL 

MC 

Valid 

Valid 

Rotate 33 bits (CF, r/m32) right CL times. 

REX.W + D3 /3 

RCR r/m64, CL 

MC 

Valid 

N.E. 

Rotate 65 bits (CF, r/m64) right CL times. Uses 
a 6 bit count. 

Cl /3 ib 

RCR r/m32, imm8 

Ml 

Valid 

Valid 

Rotate 33 bits (CF, r/m32) right /mmS times. 

REX.W + Cl /3 ib 

RCR r/m64, imm8 

Ml 

Valid 

N.E. 

Rotate 65 bits (CF, r/m64) right imm8 times. 
Uses a 6 bit count. 

DO /O 

ROL r/mS, 1 

Ml 

Valid 

Valid 

Rotate 8 bits r/m8 left once. 

REX + DO /O 

ROL r/mS*, 1 

Ml 

Valid 

N.E. 

Rotate 8 bits r/mS left once 

D2 /O 

ROL r/mS, CL 

MC 

Valid 

Valid 

Rotate 8 bits r/mS left CL times. 

REX + D2 /O 

ROL r/mS*, CL 

MC 

Valid 

N.E. 

Rotate 8 bits r/mS left CL times. 

CO /O * 

ROL r/mS, imm8 

Ml 

Valid 

Valid 

Rotate 8 bits r/mS left imm8 times. 
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Opcode** 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

REX + CO /O ib 

ROL r/mS* imm8 

Ml 

Valid 

N.E. 

Rotate 8 bits r/mS left /mmS times. 

D1 /O 

R0lr/m16, 1 

Ml 

Valid 

Valid 

Rotate 16 bits r/m 7 6 left once. 

D3 /O 

ROL r/m 7 6, CL 

MC 

Valid 

Valid 

Rotate 16 bits r/m 16 left CL times. 

Cl /O ib 

ROL r/m 16, imm8 

Ml 

Valid 

Valid 

Rotate 16 bits r/m 7 6 left /mm8 times. 

D1 /O 

ROL r/m32, 1 

Ml 

Valid 

Valid 

Rotate 32 bits r/m32 left once. 

REX.W + D1 /O 

ROL r/m64, 1 

Ml 

Valid 

N.E. 

Rotate 64 bits r/m64 left once. Uses a 6 bit 
count. 

D3 /O 

ROL r/m32, CL 

MC 

Valid 

Valid 

Rotate 32 bits r/m32 left CL times. 

REX.W + D3 /O 

ROL r/m64, CL 

MC 

Valid 

N.E. 

Rotate 64 bits r/m64 left CL times. Uses a 6 
bit count. 

Cl /O ib 

ROL r/m32, imm8 

Ml 

Valid 

Valid 

Rotate 32 bits r/m32 left /mmS times. 

REX.W + Cl /O ib 

ROL r/m64, imm8 

Ml 

Valid 

N.E. 

Rotate 64 bits r/m64 left /mmS times. Uses a 

6 bit count. 

DO/I 

ROR r/mS, 1 

Ml 

Valid 

Valid 

Rotate 8 bits r/mS right once. 

REX + DO /I 

ROR r/mS*, 1 

Ml 

Valid 

N.E. 

Rotate 8 bits r/mS right once. 

D2/1 

ROR r/mS, CL 

MC 

Valid 

Valid 

Rotate 8 bits r/mS right CL times. 

REX + D2 /I 

ROR r/m8* CL 

MC 

Valid 

N.E. 

Rotate 8 bits r/mS right CL times. 

CO /I ib 

ROR r/mS, imm8 

Ml 

Valid 

Valid 

Rotate 8 bits r/m 7 6 right imm8 times. 

REX + CO n ib 

ROR r/mS*, imm8 

Ml 

Valid 

N.E. 

Rotate 8 bits r/m 7 6 right imm8 times. 

D1 /I 

ROR r/ml 6, 1 

Ml 

Valid 

Valid 

Rotate 16 bits r/m 7 6 right once. 

D3/1 

ROR r/m 7 6, CL 

MC 

Valid 

Valid 

Rotate 16 bits r/m 16 right CL times. 

Cl /I ib 

ROR r/m 16, imm8 

Ml 

Valid 

Valid 

Rotate 16 bits r/m 7 6 right imm8 times. 

D1 /I 

ROR r/m32, 1 

Ml 

Valid 

Valid 

Rotate 32 bits r/m32 right once. 

REX.W + D1 /I 

ROR r/m64, 1 

Ml 

Valid 

N.E. 

Rotate 64 bits r/m64 right once. Uses a 6 bit 
count. 

D3/1 

ROR r/m32, CL 

MC 

Valid 

Valid 

Rotate 32 bits r/m32 right CL times. 

REX.W + D3 /I 

ROR r/m64, CL 

MC 

Valid 

N.E. 

Rotate 64 bits r/m64 right CL times. Uses a 6 
bit count. 

Cl /I ib 

ROR r/m32, imm8 

Ml 

Valid 

Valid 

Rotate 32 bits r/m32 right /mmS times. 

REX.W + Cl /I ib 

ROR r/m64, imm8 

Ml 

Valid 

N.E. 

Rotate 64 bits r/m64 right /mmS times. Uses a 

6 bit count. 


NOTES: 

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 
** See IA-32 Architecture Compatibility section below. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

Ml 

ModRM:r/m (w) 

1 

NA 

NA 

MC 

ModRM:r/m (w) 

CL 

NA 

NA 

Ml 

ModRM:r/m (w) 

imm8 

NA 

NA 
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Description 

Shifts (rotates) the bits of the first operand (destination operand) the number of bit positions specified in the 
second operand (count operand) and stores the result in the destination operand. The destination operand can be 
a register or a memory location; the count operand is an unsigned integer that can be an immediate or a value in 
the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W = 1). 

The rotate left (ROL) and rotate through carry left (RCL) instructions shift all the bits toward more-significant bit 
positions, except for the most-significant bit, which is rotated to the least-significant bit location. The rotate right 
(ROR) and rotate through carry right (RCR) instructions shift all the bits toward less significant bit positions, except 
for the least-significant bit, which is rotated to the most-significant bit location. 

The RCL and RCR instructions include the CF flag in the rotation. The RCL instruction shifts the CF flag into the 
least-significant bit and shifts the most-significant bit into the CF flag. The RCR instruction shifts the CF flag into 
the most-significant bit and shifts the least-significant bit into the CF flag. For the ROL and ROR instructions, the 
original value of the CF flag is not a part of the result, but the CF flag receives a copy of the bit that was shifted from 
one end to the other. 

The OF flag is defined only for the 1-bit rotates; it is undefined in all other cases (except RCL and RCR instructions 
only: a zero-bit rotate does nothing, that is affects no flags). For left rotates, the OF flag is set to the exclusive OR 
of the CF bit (after the rotate) and the most-significant bit of the result. For right rotates, the OF flag is set to the 
exclusive OR of the two most-significant bits of the result. 

In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Use of 
REX.W promotes the first operand to 64 bits and causes the count operand to become a 6-bit counter. 

IA-32 Architecture Compatibility 

The 8086 does not mask the rotation count. Flowever, all other IA-32 processors (starting with the Intel 286 
processor) do mask the rotation count to 5 bits, resulting in a maximum count of 31. This masking is done in all 
operating modes (including the virtual-8086 mode) to reduce the maximum execution time of the instructions. 


Operation 

(* RCL and RCR instructions *) 
SIZE <- OperandSize; 

CASE (determine count) OF 


SIZE 4^ 

-8: 

tempCOUNT 

SIZE 4 

-16: 

tempCOUNT 

SIZE 4 

-32: 

tempCOUNT 

SIZE 4 

-64: 

tempCOUNT 


ESAC; 


(COUNT AND 1FH) MOD 9; 
(COUNT AND 1 FH) MOD 17; 
COUNT AND 1FH; 

COUNT AND 3FH; 


(* RCL instruction operation *) 

WHILE (tempCOUNT ^ 0) 

DO 

tempCF ^ MSB(DEST); 

DEST ^ (DEST * 2) + CF; 

CF <- tempCF; 

tempCOUNT ^ tempCOUNT - 1; 
OD; 

ELIHW; 

IF (COUNT & COUNTMASK) = 1 
THEN OF ^ MSB(DEST) XOR CF; 

ELSE OF is undefined; 
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(* RCR Instruction operation *) 

IF (COUNT & COUNTMASK) = 1 
THEN OF ^ MSB(DEST) XOR CF; 

ELSE OF is undefined; 

FI; 

WHILE (tempCOUNT ^ 0) 

DO 

tempCF ^ LSB(SRC); 

BEST ^ (BEST / 2) + (CF * 2^'^^); 
CF <- tempCF; 

tempCOUNT ^ tempCOUNT - 1 ; 
OB; 

(* ROL and ROR instructions *) 

IF OperandSIze = 64 

THEN COUNTMASK = 3FH; 

ELSE COUNTMASK = 1FH; 

FI; 


(* ROL instruction operation *) 

tempCOUNT ^ (COUNT & COUNTMASK) MOD SIZE 

WHILE (tempCOUNT ^ 0) 

DO 

tempCF ^ MSB(DEST); 

BEST ^ (BEST * 2) + tempCF; 
tempCOUNT ^ tempCOUNT - 1 ; 

OB; 

ELIHW; 

IF (COUNT & COUNTMASK) ^ 0 
THEN CF ^ LSB(DEST); 

FI; 

IF (COUNTS COUNTMASK) = 1 
THEN OF ^ MSB(DEST) XOR CF; 

ELSE OF is undefined; 


(* ROR instruction operation *) 

tempCOUNT ^ (COUNT & COUNTMASK) MOD SIZE 

WHILE (tempCOUNT 0) 

DO 

tempCF ^ LSB(SRC); 

BEST ^ (BEST / 2) + (tempCF * 2^'^^); 
tempCOUNT ^ tempCOUNT - 1 ; 

OB; 

ELIHW; 

IF (COUNT & COUNTMASK) ^ 0 
THEN CF ^ MSB(DEST); 

FI; 

IF (COUNTS COUNTMASK) = 1 

THEN OF ^ MSB(DEST) XOR MSB - 1 (BEST); 
ELSE OF is undefined; 

FI; 
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Flags Affected 

If the masked count is 0, the flags are not affected. If the masked count is 1, then the OF flag is affected, otherwise 
(masked count is greater than 1) the OF flag is undefined. The OF flag is affected when the masked count is non¬ 
zero. The SF, ZF, AF, and PF flags are always unaffected. 


Protected Mode Exceptions 


#GP(0) 


#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If the source operand is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If the LOCK prefix is used. 


Real-Address Mode 

#GP 

#SS 

#UD 


Exceptions 

If a memory operand effective 
If a memory operand effective 
If the LOCK prefix is used. 


address 

address 


is outside the CS, DS, ES, FS, or GS segment limit, 
is outside the SS segment limit. 


Virtual-SOSe Mode 

#GP(0) 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 

If the LOCK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the source operand is located in a nonwritable segment. 

If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 
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RCPPS—Compute Reciprocals of Packed Single-Precision Floating-Point Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 53 /r 

RCPPS xmml, xmm2/ml28 

RM 

V/V 

SSE 

Computes the approximate reciprocals of the 
packed single-precision floating-point values 
in xmm2/ml28an6 stores the results in 
xmml. 

VEX.128.0F.WIG53/r 

VRCPPS xmm 1, xmm2/m 128 

RM 

v/v 

AVX 

Computes the approximate reciprocals of 
packed single-precision values in xmm2/mem 
and stores the results in xmml. 

VEX.256.0F.WIG 53 /r 

VRCPPS ymm 1, ymm2/m256 

RM 

V/V 

AVX 

Computes the approximate reciprocals of 
packed single-precision values in ymm2/mem 
and stores the results in ymml. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Performs a SIMD computation of the approximate reciprocals of the four packed single-precision floating-point 
values in the source operand (second operand) stores the packed single-precision floating-point results in the 
destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination 
operand is an XMM register. See Figure 10-5 in the Intel® 64 and IA-32 Architectures Software Developer's 
Manual, Volume 1, for an illustration of a SIMD single-precision floating-point operation. 

The relative error for this approximation is: 

IRelative Error| < 1.5 * 2“^^ 

The RCPPS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 
0.0, an of the sign of the source value is returned. A denormal source value is treated as a 0.0 (of the same sign). 
Tiny results (see Section 4.9.1.5, "Numeric Underflow Exception (#U)" in Intel® 64 and IA-32 Architectures Soft¬ 
ware Developer's Manual, Volume 1) are always flushed to 0.0, with the sign of the operand. (Input values greater 
than or equal to 11.111111111101000000000006*2^^^1 are guaranteed to not produce tiny results; input values 
less than or equal to 11.000000000001100000000016*2^^®! are guaranteed to produce tiny results, which are in 
turn flushed to 0.0; and input values in between this range may or may not produce tiny results, depending on the 
implementation.) When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN 
is returned. 

In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers 
(XMM8-XMM15). 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (VLMAX-1:128) of the corresponding 
VMM register destination are unmodified. 

VEX.128 encoded version: the first source operand is an XMM register or 128-bit memory location. The destination 
operand is an XMM register. The upper bits (VLMAX-1:128) of the corresponding VMM register destination are 
zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand can be a VMM 
register or a 256-bit memory location. The destination operand is a VMM register. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 
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Operation 

RCPPS (128-bit Legacy SSE version) 

DEST[31:0] ^ APPROXIMATE(1/SRC[31:0]) 
DEST[63:32] ^ APPROXIMATE(1/SRC[63:32]) 
DEST[95:64] ^ APPROXIMATE(1/SRC[95:64]) 
DEST[127:96] ^ APPROXIMATE]! /SRC[127:96]) 
DEST[VLMAX-1:128] (Unmodified) 


VRCPPS (VEX.128 encoded version) 

DEST[31:0] ^ APPROXIMATE(1/SRC[31:0]) 
DEST[63:32] ^ APPROXIMATE(1/SRC[63:32]) 
DEST[95:64] ^ APPROXIMATE(1/SRC[95:64]) 
DEST[127:96] ^ APPROXIMATE]! /SRC[127:96]) 
DEST[VLMAX-!:!28]^0 


VRCPPS (VEX.256 encoded version) 

DEST[3!:0] ^ APPROXIMATE]!/SRC[3! :0]) 
DEST[63:32] ^ APPROXIMATE]!/SRC[63:32]) 
DEST[95:64] ^ APPROXIMATE]!/SRC[95:64]) 

DEST[! 27:96] ^ APPROXIMATE]! /SRC[! 27:96]) 
DEST[! 59:! 28] ^ APPROXIMATE]! /SRC[! 59:! 28]) 
DEST[! 9!:! 60] ^ APPROXIMATE]!/SRC[! 9!:! 60]) 
DEST[223:! 92] ^ APPROXIMATE]! /SRC[223:! 92]) 
DEST[255:224] ^ APPROXIMATE]!/SRC[255:224]) 

Intel C/C++ Compiler Intrinsic Equivalent 

RCCPS: _m! 28 _mm_rcp_ps(_m! 28 a) 

RCPPS: _m256 _mm256_rcp_ps ]_m256 a); 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD If VEX.vvvv iiiiB. 
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RCPSS—Compute Reciprocal of Scalar Single-Precision Floating-Point Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 53 /r 

RCPSS xmmi, xmm2/m32 

RM 

V/V 

SSE 

Computes the approximate reciprocal of the 
scalar single-precision floating-point value in 
xmm2/m32 and stores the result in xmm7. 

VEX.NDS.LIG.F3.0F.WIG 53 /r 

VRCPSS xmm 7, xmmZ, xmm3/m32 

RVM 

v/v 

AVX 

Computes the approximate reciprocal of the 
scalar single-precision floating-point value in 
xmm3/m32 and stores the result in xmm7. 
Also, upper single precision floating-point 
values (bits[127:32]) from xmm2 are copied to 
xmm7[127:32]. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Computes of an approximate reciprocal of the low single-precision floating-point value in the source operand 
(second operand) and stores the single-precision floating-point result in the destination operand. The source 
operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The 
three high-order doublewords of the destination operand remain unchanged. See Figure 10-6 in the Intel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a scalar single-precision floating¬ 
point operation. 

The relative error for this approximation is: 

IRelative Error] < 1.5 * 2“^^ 

The RCPSS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 
0.0, an of the sign of the source value is returned. A denormal source value is treated as a 0.0 (of the same sign). 
Tiny results (see Section 4.9.1.5, "Numeric Underflow Exception (#U)" in Intel® 64 and IA-32 Architectures Soft¬ 
ware Developer's Manual, Volume 1) are always flushed to 0.0, with the sign of the operand. (Input values greater 
than or equal to 11.111111111101000000000006*2^^^1 are guaranteed to not produce tiny results; input values 
less than or equal to 11.000000000001100000000016*2^^^1 are guaranteed to produce tiny results, which are in 
turn flushed to 0.0; and input values in between this range may or may not produce tiny results, depending on the 
implementation.) When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN 
is returned. 

In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers 
(XMM8-XMM15). 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. 6its (VLMAX- 
1:32) of the corresponding VMM destination register remain unchanged. 

VEX.128 encoded version: 6its (VLMAX-1:128) of the destination VMM register are zeroed. 

Operation 

RCPSS (128-bit Legacy SSE version) 

DEST[31:0] ^ APPROXIMATE(1/SRC[31:0]) 

DEST[VLMAX-1:32] (Unmodified) 
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VRCPSS {VEX.128 encoded version) 

DEST[31:0] ^ APPROXIMATE(1/SRC2[31:0]) 
DEST[127:32] ^SRCI [127:32] 
DEST[VLMAX-1:128]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

RCPSS: _m128_mm_rcp_ss(_ml 28 a) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 5. 
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RDFSBAS6/RDGSBASE—Read FS/GS Segment Base 


Opcode/ 

Instruction 


Op/ 

En 

64/32- 

bit 

Mode 

CPUID Fea¬ 
ture Flag 

Description 

F3 OF AE /O 

RDFSBASE r32 


M 

V/l 

FSGSBASE 

Load the 32-bit destination register with the FS 
base address. 

F3 REX.W OF AE /O 
RDFSBASE r64 


M 

V/l 

FSGSBASE 

Load the 64-bit destination register with the FS 
base address. 

F3 OF AE /I 

RDGSBASE r32 


M 

V/l 

FSGSBASE 

Load the 32-bit destination register with the GS 
base address. 

F3 REX.W OF AE /I 
RDGSBASE r64 


M 

v/l 

FSGSBASE 

Load the 64-bit destination register with the GS 
base address. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Loads the general-purpose register indicated by the modR/M:r/m field with the FS or GS segment base address. 

The destination operand may be either a 32-bit or a 64-bit general-purpose register. The REX.W prefix indicates the 
operand size is 64 bits. If no REX.W prefix is used, the operand size is 32 bits; the upper 32 bits of the source base 
address (for FS or GS) are ignored and upper 32 bits of the destination register are cleared. 

This instruction is supported only in 64-bit mode. 

Operation 

DEST FS/GS segment base address; 

Flags Affected 

None 

C/C++ Compiler Intrinsic Equivalent 

RDFSBASE: unsigned int _readfsbase_u32(uoid); 

RDFSBASE: unsigned_int64 _readfsbase_u64(void ); 

RDGSBASE: unsigned int_readgsbase_u32(void); 

RDGSBASE: unsigned_int64 _readgsbase_u64(void ); 

Protected Mode Exceptions 

#UD The RDFSBASE and RDGSBASE instructions are not recognized in protected mode. 

Real-Address Mode Exceptions 

#UD The RDFSBASE and RDGSBASE instructions are not recognized in real-address mode. 

Virtual-SOSe Mode Exceptions 

#UD The RDFSBASE and RDGSBASE instructions are not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

#UD The RDFSBASE and RDGSBASE instructions are not recognized in compatibility mode. 
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64-Bit Mode Exceptions 

#UD If the LOCK prefix is used. 

If CR4.FSGSBASE[bit 16] = 0. 

If CPUID.07H.0H:EBX.FSGSBASE[bit 0] = 0. 
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RDMSR—Read from Model Specific Register 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 32 

RDMSR 

NP 

Valid 

Valid 

Read MSR specified by ECX into EDX:EAX. 


NOTES: 

* See IA-32 Architecture Compatibility section below. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Reads the contents of a 64-bit model specific register (MSR) specified in the ECX register into registers EDX:EAX. 
(On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The EDX register 
is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On 
processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.) If 
fewer than 64 bits are implemented in the MSR being read, the values returned to EDX: EAX in unimplemented bit 
locations are undefined. 

This instruction must be executed at privilege level 0 or in real-address mode; otherwise, a general protection 
exception #GP(0) will be generated. Specifying a reserved or unimplemented MSR address in ECX will also cause a 
general protection exception. 

The MSRs control functions for testability, execution tracing, performance-monitoring, and machine check errors. 
Chapter 35, "Model-Specific Registers (MSRs)," in the Intel® 64 and IA-32 Architectures Software Developer's 
Manual, Volume 3C, lists all the MSRs that can be read with this instruction and their addresses. Note that each 
processor family has its own set of MSRs. 

The CPUID instruction should be used to determine whether MSRs are supported (CPUID.01H:EDX[5] = 1) before 
using this instruction. 

IA-32 Architecture Compatibility 

The MSRs and the ability to read them with the RDMSR instruction were introduced into the IA-32 Architecture with 
the Pentium processor. Execution of this instruction by an IA-32 processor earlier than the Pentium processor 
results in an invalid opcode exception #UD. 

See "Changes to Instruction Behavior in VMX Non-Root Operation" in Chapter 25 of the Intel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 3C, for more information about the behavior of this instruction in 
VMX non-root operation. 

Operation 

EDX:EAX ^ MSR[ECX]; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the current privilege level 

If the value in ECX specifies 
#UD If the LOCK prefix is used. 


is not 0. 

a reserved or unimplemented MSR address. 
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Real-Address Mode Exceptions 

#GP If the value in ECX specifies a reserved or unimplemented MSR address. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) The RDMSR instruction is not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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RDPID—Read Processor ID 


Opcode/ 

Instruction 

Op/ 

En 

64/32- 

bit 

Mode 

CPUID 

Feature Flag 

Description 

F3 OF C7 n 

RDPID r32 

M 

N.E./V 

RDPID 

Read IA32_TSC_AUX Into r32. 

F3 OF C7 /7 

RDPID r64 

M 

V/N.E. 

RDPID 

Read IA32_TSC_AUX Into r64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Reads the value of the IA32_TSC_AUX MSR (address C0000103H) into the destination register. The value of CS.D 
and operand-size prefixes (66H and REX.W) do not affect the behavior of the RDPID instruction. 

Operation 

DEST ^ IA32_TSC_AUX 

Flags Affected 

None. 

Protected Mode Exceptions 

#UD If the LOCK prefix is used. 

If the F2 prefix is used. 

If CPUID.7H.0:ECX.RDPID[bit 22] = 0. 

Real-Address Mode Exceptions 

Same exceptions as in protected mode. 

Virtual-SOSe Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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RDPKRU—Read Protection Key Rights for User Pages 


Opcode* 

Instruction 

Op/ 

En 

64/32bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 01 EE 

RDPKRU 

NP 

V/V 

OSPKE 

Reads PKRU into EAX. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Reads the value of PKRU into EAX and clears EDX. ECX must be 0 when RDPKRU is executed; otherwise, a general- 
protection exception (#GP) occurs. 

RDPKRU can be executed only if CR4.PKE = 1; otherwise, an invalid-opcode exception (#UD) occurs. Software can 
discover the value of CR4.PKE by examining CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]. 

On processors that support the Intel 64 Architecture, the high-order 32-bits of RCX are ignored and the high-order 
32-bits of RDX and RAX are cleared. 

Operation 

IF (ECX = 0) 

THEN 

EAX ^ PKRU; 

EDX ^ 0; 

ELSE #GP(0); 

FI; 

Flags Affected 

None. 

C/C++ Compiler Intrinsic Equivalent 

RDPKRU: uint32_t _rdpl<ru_u32(void); 

Protected Mode Exceptions 

#GP(0) If ECX * 0 

#UD If the LOCK prefix is used. 

If CR4.PKE = 0. 

Real-Address Mode Exceptions 

Same exceptions as in protected mode. 

\/irtual-8086 Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 
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e4-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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RDPMC—Read Performance-Monitoring Counters 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 33 

RDPMC 

NP 

Valid 

Valid 

Read performance-monitoring counter 
specified by ECX into EDX:EAX. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

The EAX register is loaded with the low-order 32 bits. The EDX register is loaded with the supported high-order bits 
of the counter. The number of high-order bits loaded into EDX is implementation specific on processors that do no 
support architectural performance monitoring. The width of fixed-function and general-purpose performance coun¬ 
ters on processors supporting architectural performance monitoring are reported by CPUID OAH leaf. See below for 
the treatment of the EDX register for "fast" reads. 

The ECX register specifies the counter type (if the processor supports architectural performance monitoring) and 
counter index. Counter type is specified in ECX[30] to select one of two type of performance counters. If the 
processor does not support architectural performance monitoring, ECX[30:0] specifies the counter index; other¬ 
wise ECX[29:0] specifies the index relative to the base of each counter type. ECX[31] selects "fast" read mode if 
supported. The two counter types are: 

• General-purpose or special-purpose performance counters are specified with ECX[30] = 0: The number of 
general-purpose performance counters on processor supporting architectural performance monitoring are 
reported by CPUID OAH leaf. The number of general-purpose counters is model specific if the processor does 
not support architectural performance monitoring, see Chapter 18, "Performance Monitoring" of I ntel® 64 and 
IA-32 Architectures Software Developer's Manual, Volume 3B. Special-purpose counters are available only in 
selected processor members, see Table 4-16. 

• Fixed-function performance counter are specified with ECX[30] = 1. The number fixed-function performance 
counters is enumerated by CPUID OAH leaf. See Chapter 30 of Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 3B. This counter type is selected if ECX[30] is set. 

The width of fixed-function performance counters and general-purpose performance counters on processor 
supporting architectural performance monitoring are reported by CPUID OAH leaf. The width of general-purpose 
performance counters are 40-bits for processors that do not support architectural performance monitoring coun¬ 
ters. The width of special-purpose performance counters are implementation specific. 

Table 4-16 lists valid indices of the general-purpose and special-purpose performance counters according to the 
DisplayFamily_DisplayModel values of CPUID encoding for each processor family (see CPUID instruction in Chapter 
3, "Instruction Set Reference, A-L" in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 
2A). 


Table 4-16. Valid General and Special Purpose Performance Counter Index Range for RDPMC 


Processor Family 

DisplayFamily_DisplayModel/ 
Other Signatures 

Valid PMC Index 

Range 

General-purpose 

Counters 

P6 

06H_01H, 06H_03H, 06H_05H, 
06H_06H, 06H_07H, 06H_08H, 
06H_0AH,06H_0BH 

0,1 

0,1 

Processors Based on Intel NetBurst 
microarchitecture (No L3) 

0FH_00H,0FH_01H, 0FH_02H, 
0FH_03H,0FH_04H, 0FH_06H 

> 0 and < 17 

> 0 and < 17 

Pentium M processors 

06H_09H, 06H_0DH 

0,1 

0,1 

Processors Based on Intel NetBurst 
microarchitecture (No L3) 

0FH_03H, 0FH_04H) and (L3 is 
present) 

> 0 and < 25 

> 0 and < 17 
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Table 4-16. Valid General and Special Purpose Performance Counter Index Range for RDPMC (Contd.) 


Processor Family 

DisplayFamily_DisplayModel/ 
Other Signatures 

Vaiid PMC Index 

Range 

General-purpose 

Counters 

Intel® Core" Solo and Intel® Core" Duo 
processors. Dual-core Intel® Xeon® 
processor LV 

06H_0EH 

0,1 

0,1 

Intel® Core"2 Duo processor, Intel Xeon 
processor 3000,5100, 5300, 7300 Series - 
general-purpose PMC 

06H_0FH 

0,1 

0,1 

Intel® Core"2 Duo processor family, Intel 
Xeon processor 3100, 3300, 5200, 5400 
series - general-purpose PMC 

06H_17H 

0,1 

0,1 

Intel Xeon processors 7400 series 

(06H_1 DH) 

> 0 and < 9 

0, 1 

45 nm and 32 nm Intel® Atom" processors 

06H_1 CH, 06_26H, 06_27H, 
06_35H, 06_36H 

0,1 

0,1 

Intel® Atom™ processors based on 

Silvermont or Airmont microarchitectures 

06H_37H, 06_4AH, 06_4DH, 
06_5AH, 06_5DH,06_4CH 

0,1 

0,1 

Next Generation Intel® Atom™ processors 
based on Goldmont microarchitecture 

06H_5CH, 06_5FH 

0-3 

0-3 

Intel® processors based on the Nehalem, 
Westmere microarchitectures 

06H_1 AH, 06H_1 EH, 06H_1 FH, 
06_25H,06_2CH, 06H_2EH, 
06_2FH 

0-3 

0-3 

Intel® processors based on the Sandy 

Bridge, Ivy Bridge microarchitecture 

06H_2AH, 06H_2DH, 06H_3AH, 
06H_3EH 

0-3 (0-7 if 

HyperThreading is off) 

0-3 (0-7 if 

HyperThreading is off) 

Intel® processors based on the Haswell, 
Broadwell, SkyLake microarchitectures 

06H_3CH, 06H_45H, 06H_46H, 
06H_3FH, 06_3DH, 06_47H, 

4FH,06_56H,06_4EH, 06_5EH 

0-3 (0-7 if 

HyperThreading is off) 

0-3 (0-7 if 

HyperThreading is off) 


Processors based on Intel NetBurst microarchitecture support "fast" (32-bit) and "slow" (40-bit) reads on the first 
18 performance counters. Selected this option using ECX[31]. If bit 31 is set, RDPMC reads only the low 32 bits of 
the selected performance counter. If bit 31 is clear, all 40 bits are read. A 32-bit result is returned in EAX and EDX 
is set to 0. A 32-bit read executes faster on these processors than a full 40-bit read. 

On processors based on Intel NetBurst microarchitecture with L3, performance counters with indices 18-25 are 32- 
bit counters. EDX is cleared after executing RDPMC for these counters. 

In Intel Core 2 processor family, Intel Xeon processor 3000, 5100, 5300 and 7400 series, the fixed-function perfor¬ 
mance counters are 40-bits wide; they can be accessed by RDMPC with ECX between from 4000_0000H and 
4000_0002H. 

On Intel Xeon processor 7400 series, there are eight 32-bit special-purpose counters addressable with indices 2-9, 
ECX[30]=0. 

When in protected or virtual 8086 mode, the performance-monitoring counters enabled (PCE) flag in register CR4 
restricts the use of the RDPMC instruction as follows. When the PCE flag is set, the RDPMC instruction can be 
executed at any privilege level; when the flag is clear, the instruction can only be executed at privilege level 0. 
(When in real-address mode, the RDPMC instruction is always enabled.) 

The performance-monitoring counters can also be read with the RDMSR instruction, when executing at privilege 
level 0. 

The performance-monitoring counters are event counters that can be programmed to count events such as the 
number of instructions decoded, number of interrupts received, or number of cache loads. Chapter 19, "Perfor¬ 
mance Monitoring Events," in the I ntel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, lists 
the events that can be counted for various processors in the Intel 64 and IA-32 architecture families. 

The RDPMC instruction is not a serializing instruction; that is, it does not imply that all the events caused by the 
preceding instructions have been completed or that events caused by subsequent instructions have not begun. If 
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an exact event count is desired, software must insert a serializing instruction (such as the CPUID instruction) 
before and/or after the RDPMC instruction. 

Performing back-to-back fast reads are not guaranteed to be monotonic. To guarantee monotonicity on back-to- 
back reads, a serializing instruction must be placed between the two RDPMC instructions. 

The RDPMC instruction can execute in 16-bit addressing mode or virtual-8086 mode; however, the full contents of 
the ECX register are used to select the counter, and the event count is stored in the full EAX and EDX registers. The 
RDPMC instruction was introduced into the IA-32 Architecture in the Pentium Pro processor and the Pentium 
processor with MMX technology. The earlier Pentium processors have performance-monitoring counters, but they 
must be read with the RDMSR instruction. 

Operation 

(* Intel processors that support architectural performance monitoring *) 

Most significant counter bit (MSCB) = 47 

IF ((CR4.PCE = 1) or (CPL = 0) or (CRO.PE = 0)) 

THEN IF (ECX[30] = 1 and ECX[29:0] in valid fixed-counter range) 

EAX ^ IA32_FIXED_CTR(ECX)[30:0]; 

EDX ^ IA32_FIXED_CTR(ECX)[MSCB:32]; 

ELSE IF (ECX[30] = 0 and ECX[29:0] in valid general-purpose counter range) 

EAX ^ PMC(ECX[30:0])[31:0]; 

EDX ^ PMC(ECX[30:0])[MSCB:32]; 

ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1,2, or 3 and CRO.PE is 1 *) 

#GP(0); 

FI; 

(* Intel Core 2 Duo processor family and Intel Xeon processor 3000,5100, 5300,7400 series*) 

Most significant counter bit (MSCB) = 39 

IF ((CR4.PCE = 1) or (CPL = 0) or (CRO.PE = 0)) 

THEN IF (ECX[30] = 1 and ECX[29:0] in valid fixed-counter range) 

EAX ^ IA32_FIXED_CTR(ECX)[30:0]; 

EDX ^ IA32_FIXED_CTR(ECX)[MSCB:32]; 

ELSE IF (ECX[30] = 0 and ECX[29:0] in valid general-purpose counter range) 

EAX ^ PMC(ECX[30:0])[31:0]; 

EDX ^ PMC(ECX[30:0])[MSCB:32]; 

ELSE IF (ECX[30] = 0 and ECX[29:0] in valid special-purpose counter range) 

EAX ^ PMC(ECX[30:0])[31:0]; (* 32-bit read *) 

ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1,2, or 3 and CRO.PE is 1 *) 

#GP(0); 

FI; 

(* P6 family processors and Pentium processor with MMX technology *) 

IF (ECX = 0 or 1) and ((CR4.PCE = 1) or (CPL = 0) or (CRO.PE = 0)) 

THEN 

EAX^PMC(ECX)[31:0]; 

EDX ^ PMC(ECX)[39:32]; 

ELSE (* ECX is not 0 or 1 or CR4.PCE is 0 and CPL is 1,2, or 3 and CRO.PE is 1 *) 

#GP(0); 

FI; 

(* Processors based on Intel NetBurst microarchitecture *) 

IF ((CR4.PCE = 1) or (CPL = 0) or (CRO.PE = 0)) 

THENIF(ECX[30:0] = 0:17) 

THENIFECX[31] = 0 
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THEN 

EAX ^ PMC(ECX[30:0])[31:0]; (* 40-blt read *) 

EDX ^ PMC(ECX[30:0])[39:32]; 

ELSE (* ECX[31]=1*) 

THEN 

EAX ^ PMC(ECX[30:0])[31:0]; (* 32-blt read *) 

EDX ^ 0; 

FI; 

ELSE IF (*64-blt Intel processor based on Intel NetBurst microarchitecture with L3 *) 

THEN IF (ECX[30:0] = 18:25) 

EAX ^ PMC(ECX[30:0])[31:0]; (* 32-bit read *) 

EDX ^ 0; 

FI; 

ELSE (* Invalid PMC index in ECX[30:0], see Table 4-19. *) 

GP(0); 

FI; 

ELSE (* CR4.PCE = 0 and (CPL = 1,2, or 3) and CRO.PE = 1 *) 

#GP(0); 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the current privilege level is not 0 and the PCE flag in the CR4 register is clear. 

If an invalid performance counter index is specified (see Table 4-16). 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If an invalid performance counter index is specified (see Table 4-16). 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If the PCE flag in the CR4 register is clear. 

If an invalid performance counter index is specified (see Table 4-16). 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#GP(0) If the current privilege level is not 0 and the PCE flag in the CR4 register is clear. 

If an invalid performance counter index is specified (see Table 4-16). 

#UD If the LOCK prefix is used. 


4-540 Vol. 2B 


RDPMC—Read Performance-Monitoring Counters 


INSTRUCTION SET REFERENCE, M-U 


RDRAND—Read Random Number 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF C7 /6 

M 

V/V 

RDRAND 

Read a 16-bit random number and store in the 

RDRAND r16 




destination register. 

OF C7 /6 

M 

v/v 

RDRAND 

Read a 32-bit random number and store in the 

RDRAND r32 




destination register. 

REX.W + OF C7 /6 

M 

V/l 

RDRAND 

Read a 64-bit random number and store in the 

RDRAND r64 




destination register. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Loads a hardware generated random value and store it in the destination register. The size of the random value is 
determined by the destination register size and operating mode. The Carry Flag indicates whether a random value 
is available at the time the instruction is executed. CF=1 indicates that the data in the destination is valid. Other¬ 
wise CF=0 and the data in the destination operand will be returned as zeros for the specified width. All other flags 
are forced to 0 in either situation. Software must check the state of CF=1 for determining if a valid random value 
has been returned, otherwise it is expected to loop and retry execution of RDRAND (see I ntel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 1, Section 7.3.17, "Random Number Generator Instructions"). 

This instruction is available at all privilege levels. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.B permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bit oper¬ 
ands. See the summary chart at the beginning of this section for encoding data and limits. 

Operation 

IF HW_RND_GEN.ready = 1 
THEN 

CASE of 

osize Is 64: DEST[63:0] ^ HW_RND_GEN.data; 
osize Is 32: DEST[31:0] ^ HW_RND_GEN.data; 
osize Is 16: DEST[15:0] ^ HW_RND_GEN.data; 

ESAC 
CF^ 1; 

ELSE 

CASE of 

osize Is 64: DEST[63:0] ^ 0; 
osize Is 32: DEST[31:0] ^ 0; 
osize Is 16: DEST[15:0] ^ 0; 

ESAC 

CF^O; 

FI 

OF, SF, ZF, AF, PF ^ 0; 

Flags Affected 

The CF flag is set according to the result (see the "Operation" section above). The OF, SF, ZF, AF, and PF flags are 
set to 0. 
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Intel C/C++ Compiler Intrinsic Equivalent 

RDRAND: Int_rdrand16_step( unsigned short *); 

RDRAND: Int _rdrand32_step( unsigned Int *); 

RDRAND: Int _rdrand64_step( unsigned_int64 *); 

Protected Mode Exceptions 

#UD If the LOCK prefix is used. 

If the F2H or F3H prefix is used. 

If CPUID.01H:ECX.RDRAND[bit 30] = 0. 

Real-Address Mode Exceptions 

Same exceptions as in protected mode. 

Virtual-SOSe Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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RDSEED—Read Random SEED 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF C7 n 

RDSEED r16 

M 

VIM 

RDSEED 

Read a 16-bit NIST SP800-90B & C compliant random value and 
store in the destination register. 

OF C7 n 

RDSEED r32 

M 

V/V 

RDSEED 

Read a 32-bit NIST SP800-90B & C compliant random value and 
store in the destination register. 

REX.W + OF C7 n 

RDSEED r64 

M 

V/l 

RDSEED 

Read a 64-bit NIST SP800-90B & C compliant random value and 
store in the destination register. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Loads a hardware generated random value and store it in the destination register. The random value is generated 
from an Enhanced NRBG (Non Deterministic Random Bit Generator) that is compliant to NIST SP800-90B and NIST 
SP800-90C in the XOR construction mode. The size of the random value is determined by the destination register 
size and operating mode. The Carry Flag indicates whether a random value is available at the time the instruction 
is executed. CF=1 indicates that the data in the destination is valid. Otherwise CF=0 and the data in the destination 
operand will be returned as zeros for the specified width. All other flags are forced to 0 in either situation. Software 
must check the state of CF=1 for determining if a valid random seed value has been returned, otherwise it is 
expected to loop and retry execution of RDSEED (see Section 1.2). 

The RDSEED instruction is available at all privilege levels. The RDSEED instruction executes normally either inside 
or outside a transaction region. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.B permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bit oper¬ 
ands. See the summary chart at the beginning of this section for encoding data and limits. 


HW_NRND_GEN.data; 

HW_NRND_GEN.data; 

HW_NRND_GEN.data; 


0 ; 

0 ; 

0 ; 


OF, SF, ZF, AF, PF ^ 0; 


Operation 

IF HW_NRND_GEN.ready = 1 
THEN 

CASE of 

osize is 64: DEST[63:0] • 
osize is 32: DEST[31:0]. 
osize is 16: DEST[15:0]. 
ESAC; 

CF^ 1; 

ELSE 

CASE of 

osize is 64: DEST[63:0] • 
osize is 32: DEST[31:0]. 
osize is 16: DEST[15:0]. 
ESAC; 

CF^O; 

FI; 
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Flags Affected 

The CF flag is set according to the result (see the "Operation" section above). The OF, SF, ZF, AF, and PF flags 
are set to 0. 

C/C++ Compiler Intrinsic Equivalent 

RDSEED int _rdseed16_step( unsigned short *); 

RDSEED int _rdseed32_step( unsigned int *); 

RDSEED int _rdseed64_step( unsigned_int64 *); 

Protected Mode Exceptions 

#UD If the LOCK prefix is used. 


If the F2H or F3H prefix is used. 

If CPUID.(EAX=07H, ECX=OH):EBX.RDSEED[bit 18] = 0. 


Real-Address Mode Exceptions 


#UD 


If the LOCK prefix is used. 

If the F2H or F3H prefix is used. 

If CPUID.(EAX=07H, ECX=OH):EBX.RDSEED[bit 18] = 0. 


\/irtual-8086 Mode 

#UD 


Exceptions 

If the LOCK prefix is used. 

If the F2H or F3H prefix is used. 

If CPUID.(EAX=07H, ECX=OH):EBX.RDSEED[bit 18] = 0. 


Compatibility Mode Exceptions 


#UD 


If the LOCK prefix is used. 

If the F2H or F3H prefix is used. 

If CPUID.(EAX=07H, ECX=OH):EBX.RDSEED[bit 18] = 0. 


e4-Bit Mode Exceptions 


#UD 


If the LOCK prefix is used. 

If the F2H or F3H prefix is used. 

If CPUID.(EAX=07H, ECX=0H):EBX.RDSEED[bit 18] = 0. 
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RDTSC—Read Time-Stamp Counter 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 31 

RDTSC 

NP 

Valid 

Valid 

Read time-stamp counter into EDX:EAX. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Reads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX 
register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. 
(On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.) 

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever 
the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior. 

The time stamp disable (TSD) flag in register CR4 restricts the use of the RDTSC instruction as follows. When the 
flag is clear, the RDTSC instruction can be executed at any privilege level; when the flag is set, the instruction can 
only be executed at privilege level 0. 

The time-stamp counter can also be read with the RDMSR instruction, when executing at privilege level 0. 

The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions 
have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the 
read operation is performed. If software requires RDTSC to be executed only after all previous instructions have 
completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence 
LFENCE;RDTSC. 

This instruction was introduced by the Pentium processor. 

See "Changes to Instruction Behavior in VMX Non-Root Operation" in Chapter 25 of the I ntel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 3C, for more information about the behavior of this instruction in 
VMX non-root operation. 

Operation 

IF (CR4.TSD = 0) or (CPL = 0) or (CRO.PE = 0) 

THEN EDX:EAX ^ TlmeStampCounter; 

ELSE (* CR4.TSD = 1 and (CPL = 1,2, or 3) and CRO.PE = 1 *) 

#GP(0); 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the TSD flag in register CR4 is set and the CPL is greater than 0. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 
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Virtual-SOSe Mode Exceptions 

#GP(0) If the TSD flag in register CR4 is set. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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RDTSCP—Read Time-Stamp Counter and Processor ID 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 F9 

RDTSCP 

NP 

Valid 

Valid 

Read 64-bit time-stamp counter and 
IA32_TSC_AUX value into EDX:EAX and ECX. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Reads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers and also 
reads the value of the IA32_TSC_AUX MSR (address C0000103H) into the ECX register. The EDX register is loaded 
with the high-order 32 bits of the IA32_TSC MSR; the EAX register is loaded with the low-order 32 bits of the 
IA32_TSC MSR; and the ECX register is loaded with the low-order 32-bits of IA32_TSC_AUX MSR. On processors 
that support the Intel 64 architecture, the high-order 32 bits of each of RAX, RDX, and RCX are cleared. 

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever 
the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior. 

The time stamp disable (TSD) flag in register CR4 restricts the use of the RDTSCP instruction as follows. When the 
flag is clear, the RDTSCP instruction can be executed at any privilege level; when the flag is set, the instruction can 
only be executed at privilege level 0. 

The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. 
However, subsequent instructions may begin execution before the read operation is performed. 

See "Changes to Instruction Behavior in VMX Non-Root Operation" in Chapter 25 of the I ntel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 3C, for more information about the behavior of this instruction in 
VMX non-root operation. 

Operation 

IF (CR4.TSD = 0) or (CPL = 0) or (CRO.PE = 0) 

THEN 

EDX:EAX TimeStampCounter; 

ECX ^ IA32_TSC_AUX[31:0]; 

ELSE (* CR4.TSD = 1 and (CPL = 1,2, or 3) and CRO.PE = 1 *) 

#CP(0); 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the TSD flag in register CR4 is set and the CPL is greater than 0. 

#UD If the LOCK prefix is used. 

If CPUID.80000001H:EDX.RDTSCP[bit 27] = 0. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 

If CPUID.80000001H:EDX.RDTSCP[bit 27] = 0. 
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Virtual-SOSe Mode Exceptions 

#GP(0) If the TSD flag in register CR4 is set. 

#UD If the LOCK prefix is used. 

If CPUID.80000001H:EDX.RDTSCP[bit 27] = 0. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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REP/REPE/REPZ/REPNE/REPNZ—Repeat String Operation Prefix 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F3 6C 

REP INS mS, DX 

NP 

Valid 

Valid 

Input (E)CX bytes from port DX into ES:[(E)DI]. 

F3 6C 

REP INS mS, DX 

NP 

Valid 

N.E. 

Input RCX bytes from port DX into [RDI]. 

F3 6D 

REPINSm76,DX 

NP 

Valid 

Valid 

Input (E)CX words from port DX into ES:[(E)DI.] 

F3 6D 

REP INS m32, DX 

NP 

Valid 

Valid 

Input (E)CX doublewords from port DX into 
ES:[(E)DI]. 

F3 6D 

REP INS r/m32, DX 

NP 

Valid 

N.E. 

Input RCX default size from port DX into [RDI]. 

F3 A4 

REP MOVS m8, m8 

NP 

Valid 

Valid 

Move (E)CX bytes from DS:[(E)SI] to ES:[(E)DI]. 

F3 REX.W A4 

REP MOVS m8, m8 

NP 

Valid 

N.E. 

Move RCX bytes from [RSI] to [RDI]. 

F3 AS 

REP MOVSm76,m76 

NP 

Valid 

Valid 

Move (E)CX words from DS:[(E)SI] to ES:[(E)DI]. 

F3 AS 

REP MOVS m32, m32 

NP 

Valid 

Valid 

Move (E)CX doublewords from DS:[(E)SI] to 
ES:[(E)DI]. 

F3 REX.W AS 

REP MOVS m64, m64 

NP 

Valid 

N.E. 

Move RCX quadwords from [RSI] to [RDI]. 

F3 6E 

REP OUTS DX, r/mS 

NP 

Valid 

Valid 

Output (E)CX bytes from DS:[(E)SI] to port DX. 

F3 REX.W 6E 

REP OUTS DX, r/m8* 

NP 

Valid 

N.E. 

Output RCX bytes from [RSI] to port DX. 

F3 6F 

REP OUTS DX,r/m76 

NP 

Valid 

Valid 

Output (E)CX words from DS:[(E)SI] to port DX. 

F3 6F 

REP OUTS DX, r/m32 

NP 

Valid 

Valid 

Output (E)CX doublewords from DS:[(E)SI] to 
port DX. 

F3 REX.W 6F 

REP OUTS DX, r/m32 

NP 

Valid 

N.E. 

Output RCX default size from [RSI] to port DX. 

F3 AC 

REP CODS AL 

NP 

Valid 

Valid 

Load (E)CX bytes from DS:[(E)SI] to AL. 

F3 REX.W AC 

REP CODS AL 

NP 

Valid 

N.E. 

Load RCX bytes from [RSI] to AL. 

F3 AD 

REP LODS AX 

NP 

Valid 

Valid 

Load (E)CX words from DS:[(E)SI] to AX. 

F3 AD 

REP LODS EAX 

NP 

Valid 

Valid 

Load (E)CX doublewords from DS:[(E)SI] to 

EAX. 

F3 REX.W AD 

REP LODS RAX 

NP 

Valid 

N.E. 

Load RCX quadwords from [RSI] to RAX. 

F3 AA 

REP STOS mS 

NP 

Valid 

Valid 

Fill (E)CX bytes at ES:[(E)DI] with AL. 

F3 REX.W AA 

REP STOS mS 

NP 

Valid 

N.E. 

Fill RCX bytes at [RDI] with AL. 

F3 AB 

REP STOS m76 

NP 

Valid 

Valid 

Fill (E)CX words at ES:[(E)DI] with AX. 

F3 AB 

REP STOS m32 

NP 

Valid 

Valid 

Fill (E)CX doublewords at ES:[(E)DI] with EAX. 

F3 REX.W AB 

REP STOS m64 

NP 

Valid 

N.E. 

Fill RCX quadwords at [RDI] with RAX. 

F3 A6 

REPE CMPS mS, m8 

NP 

Valid 

Valid 

Find nonmatching bytes in ES:[(E)DI] and 
DS:[(E)SI]. 

F3 REX.W A6 

REPE CMPS mS, mS 

NP 

Valid 

N.E. 

Find non-matching bytes in [RDI] and [RSI]. 

F3 A7 

REPE CMPS m76, m76 

NP 

Valid 

Valid 

Find nonmatching words in ES:[(E)DI] and 
DS:[(E)SI]. 

F3 A7 

REPE CMPS m32, m32 

NP 

Valid 

Valid 

Find nonmatching doublewords in ES:[(E)DI] 
and DS:[(E)SI]. 

F3 REX.W A7 

REPE CMPS m64, m64 

NP 

Valid 

N.E. 

Find non-matching quadwords in [RDI] and 
[RSI]. 

F3 AE 

REPE SCAS mS 

NP 

Valid 

Valid 

Find non-AL byte starting at ES:[(E)DI]. 

F3 REX.W AE 

REPE SCAS m8 

NP 

Valid 

N.E. 

Find non-AL byte starting at [RDI]. 

F3 AF 

REPE SCASm76 

NP 

Valid 

Valid 

Find non-AX word starting at ES:[(E)DI]. 

F3 AF 

REPE SCAS m32 

NP 

Valid 

Valid 

Find non-EAX doubleword starting at 

ES:[(E)DI]. 
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Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F3 REX.W AF 

REPE SCAS m64 

NP 

Valid 

N.E. 

Find non-RAX quadword starting at [RDI]. 

F2 A6 

REPNE CMPS mS, mS 

NP 

Valid 

Valid 

Find matching bytes in ES:[(E)DI] and DS:[(E)SI]. 

F2 REX.W A6 

REPNE CMPS mS, mS 

NP 

Valid 

N.E. 

Find matching bytes in [RDI] and [RSI]. 

F2 A7 

REPNE CMPS m76, m76 

NP 

Valid 

Valid 

Find matching words in ES:[(E)DI] and 

DS:[(E)SI]. 

F2 A7 

REPNE CMPS m32, m32 

NP 

Valid 

Valid 

Find matching doublewords in ES:[(E)DI] and 
DS:[(E)SI]. 

F2 REX.W A7 

REPNE CMPS m64, m64 

NP 

Valid 

N.E. 

Find matching doublewords in [RDI] and [RSI]. 

F2 AE 

REPNE SCAS m8 

NP 

Valid 

Valid 

Find AL, starting at ES:[(E)DI]. 

F2 REX.W AE 

REPNE SCAS m8 

NP 

Valid 

N.E. 

Find AL, starting at [RDI]. 

F2 AF 

REPNE SCAS m76 

NP 

Valid 

Valid 

Find AX, starting at ES:[(E)DI]. 

F2 AF 

REPNE SCAS m32 

NP 

Valid 

Valid 

Find EAX, starting at ES:[(E)DI]. 

F2 REX.W AF 

REPNE SCAS m64 

NP 

Valid 

N.E. 

Find RAX, starting at [RDI]. 

NOTES: 

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Repeats a string instruction the number of times specified in the count register or until the indicated condition of 
the ZF flag is no longer met. The REP (repeat), REPE (repeat while equal), REPNE (repeat while not equal), REPZ 
(repeat while zero), and REPNZ (repeat while not zero) mnemonics are prefixes that can be added to one of the 
string instructions. The REP prefix can be added to the INS, OUTS, MOVS, LODS, and STOS instructions, and the 
REPE, REPNE, REPZ, and REPNZ prefixes can be added to the CMPS and SCAS instructions. (The REPZ and REPNZ 
prefixes are synonymous forms of the REPE and REPNE prefixes, respectively.) The F3H prefix is defined for the 
following instructions and undefined for the rest: 

• F3H as REP/REPE/REPZ for string and input/output instruction. 

• F3H is a mandatory prefix for POPCNT, LZCNT, and ADOX. 

The REP prefixes apply only to one string instruction at a time. To repeat a block of instructions, use the LOOP 
instruction or another looping construct. All of these repeat prefixes cause the associated instruction to be repeated 
until the count in register is decremented to 0. See Table 4-17. 


Table 4-17. Repeat Prefixes 


Repeat Prefix 

Termination Condition 1* 

Termination Condition 2 

REP 

RCX or (E)CX = 0 

None 

REPE/REPZ 

RCX or (E)CX = 0 

o 

II 

U_ 

M 

REPNE/REPNZ 

RCX or (E)CX = 0 

ZF = 1 


NOTES: 

* Count register is CX, ECX or RCX by default, depending on attributes of the operating modes. 
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The REPE, REPNE, REPZ, and REPNZ prefixes also check the state of the ZF flag after each iteration and terminate 
the repeat loop if the ZF flag is not in the specified state. When both termination conditions are tested, the cause 
of a repeat termination can be determined either by testing the count register with a JECXZ instruction or by 
testing the ZF flag (with a JZ, JNZ, or JNE instruction). 

When the REPE/REPZ and REPNE/REPNZ prefixes are used, the ZF flag does not require initialization because both 
the CMPS and SCAS instructions affect the ZF flag according to the results of the comparisons they make. 

A repeating string operation can be suspended by an exception or interrupt. When this happens, the state of the 
registers is preserved to allow the string operation to be resumed upon a return from the exception or interrupt 
handler. The source and destination registers point to the next string elements to be operated on, the EIP register 
points to the string instruction, and the ECX register has the value it held following the last successful iteration of 
the instruction. This mechanism allows long string operations to proceed without affecting the interrupt response 
time of the system. 

When a fault occurs during the execution of a CMPS or SCAS instruction that is prefixed with REPE or REPNE, the 
EFLAGS value is restored to the state prior to the execution of the instruction. Since the SCAS and CMPS instruc¬ 
tions do not use EFLAGS as an input, the processor can resume the instruction after the page fault handler. 

Use the REP INS and REP OUTS instructions with caution. Not all I/O ports can handle the rate at which these 
instructions execute. Note that a REP STOS instruction is the fastest way to initialize a large block of memory. 

In 64-bit mode, the operand size of the count register is associated with the address size attribute. Thus the default 
count register is RCX; REX.W has no effect on the address size and the count register. In 64-bit mode, if 67FI is 
used to override address size attribute, the count register is ECX and any implicit source/destination operand will 
use the corresponding 32-bit index register. See the summary chart at the beginning of this section for encoding 
data and limits. 

REP INS may read from the I/O port without writing to the memory location if an exception or VM exit occurs due 
to the write (e.g. #PF). If this would be problematic, for example because the I/O port read has side-effects, soft¬ 
ware should ensure the write to the memory location does not cause an exception or VM exit. 

Operation 

IF AddressSize = 16 
THEN 

Use CX for CountReg; 

Implicit Source/Dest operand for memory use of SI/DI; 

ELSE IF AddressSize = 64 
THEN Use RCX for CountReg; 

Implicit Source/Dest operand for memory use of RSI/RDI; 

ELSE 

Use ECX for CountReg; 

Implicit Source/Dest operand for memory use of ESI/EDI; 

FI; 

WHILE CountReg t- 0 
DO 

Service pending Interrupts (if any); 

Execute associated string instruction; 

CountReg <- (CountReg - 1); 

IF CountReg = 0 

THEN exit WHILE loop; FI; 

IF (Repeat prefix is REPZ or REPE) and (ZF = 0) 
or (Repeat prefix is REPNZ or REPNE) and (ZF = 1) 

THEN exit WHILE loop; FI; 

OD; 

Flags Affected 

None; however, the CMPS and SCAS instructions do set the status flags in the EFLAGS register. 
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Exceptions (All Operating Modes) 

Exceptions may be generated by an instruction associated with the prefix. 

e4-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 
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RET—Return from Procedure 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

C3 

RET 

NP 

Valid 

Valid 

Near return to calling procedure. 

CB 

RET 

NP 

Valid 

Valid 

Far return to calling procedure. 

C2 iw 

RET immlB 

1 

Valid 

Valid 

Near return to calling procedure and pop 
imm 1B bytes from stack. 

CA iw 

RET immlB 

1 

Valid 

Valid 

Far return to calling procedure and pop immlB 
bytes from stack. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 

1 

imm16 

NA 

NA 

NA 


Description 

Transfers program control to a return address located on the top of the stack. The address is usually placed on the 
stack by a CALL instruction, and the return is made to the instruction that follows the CALL instruction. 

The optional source operand specifies the number of stack bytes to be released after the return address is popped; 
the default is none. This operand can be used to release parameters from the stack that were passed to the called 
procedure and are no longer needed. It must be used when the CALL instruction used to switch to a new procedure 
uses a call gate with a non-zero word count to access the new procedure. Here, the source operand for the RET 
instruction must specify the same number of bytes as is specified in the word count field of the call gate. 

The RET instruction can be used to execute three different types of returns: 

• Near return — A return to a calling procedure within the current code segment (the segment currently pointed 
to by the CS register), sometimes referred to as an intrasegment return. 

• Far return — A return to a calling procedure located in a different segment than the current code segment, 
sometimes referred to as an intersegment return. 

• I nter-privilege-level far return — A far return to a different privilege level than that of the currently 
executing program or procedure. 

The inter-privilege-level return type can only be executed in protected mode. See the section titled "Calling Proce¬ 
dures Using Call and RET" in Chapter 6 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, 
Volume 1, for detailed information on near, far, and inter-privilege-level returns. 

When executing a near return, the processor pops the return instruction pointer (offset) from the top of the stack 
into the EIP register and begins program execution at the new instruction pointer. The CS register is unchanged. 

When executing a far return, the processor pops the return instruction pointer from the top of the stack into the EIP 
register, then pops the segment selector from the top of the stack into the CS register. The processor then begins 
program execution in the new code segment at the new instruction pointer. 

The mechanics of an inter-privilege-level far return are similar to an intersegment return, except that the 
processor examines the privilege levels and access rights of the code and stack segments being returned to deter¬ 
mine if the control transfer is allowed to be made. The DS, ES, FS, and GS segment registers are cleared by the RET 
instruction during an inter-privilege-level return if they refer to segments that are not allowed to be accessed at the 
new privilege level. Since a stack switch also occurs on an inter-privilege level return, the ESP and SS registers are 
loaded from the stack. 

If parameters are passed to the called procedure during an inter-privilege level call, the optional source operand 
must be used with the RET instruction to release the parameters on the return. Here, the parameters are released 
both from the called procedure's stack and the calling procedure's stack (that is, the stack being returned to). 

In 64-bit mode, the default operation size of this instruction is the stack-address size, i.e. 64 bits. This applies to 
near returns, not far returns; the default operation size of far returns is 32 bits. 
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Operation 

(* Near return *) 

IF Instruction = near return 
THEN; 

IF OperandSlze = 3Z 
THEN 

IF top 4 bytes of stack not within stack limits 
THEN #SS(0); FI; 

EIP ^ Pop(); 

ELSE 

IF OperandSize = 64 
THEN 

IF top 8 bytes of stack not within stack limits 
THEN #SS(0); FI; 

RIP ^ Pop(); 

ELSE (* OperandSize =16*) 

IF top 2 bytes of stack not within stack limits 
THEN #SS(0); FI; 
tempElP Pop(); 

tempElP ^ tempElP AND OOOOFFFFH; 

IF tempElP not within code segment limits 
THEN #GP(0); FI; 

EIP tempElP; 

FI; 

FI; 

IF instruction has immediate operand 

THEN (* Release parameters from stack *) 

IF StackAddressSIze = 32 
THEN 

ESP ^ ESP + SRC; 

ELSE 

IF StackAddressSIze = 64 
THEN 

RSP ^ RSP + SRC; 

ELSE (* StackAddressSIze =16*) 

SP ^ SP + SRC; 


(* Real-address mode or vlrtual-8086 mode *) 

IF ((PE = 0) or (PE = 1 AND VM = 1)) and Instruction = far return 
THEN 

IF OperandSize = 32 
THEN 

IF top 8 bytes of stack not within stack limits 
THEN #SS(0); FI; 

EIP ^ Pop(); 

CS Pop(); (* 32-blt pop, high-order 16 bits discarded *) 
ELSE (* OperandSize =16*) 

IF top 4 bytes of stack not within stack limits 
THEN #SS(0); FI; 
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tempElP <- Pop(); 

tempElP ^ tempElP AND OOOOFFFFH; 

IF tempElP not within code segment limits 
THEN #GP(0); FI; 

EIP <- tempElP; 

CS ^ Pop(); (* 16-bit pop *) 

FI; 

IF Instruction has Immediate operand 

THEN (* Release parameters from stack *) 

SP ^ SP + (SRC AND FFFFH); 

FI; 


(* Protected mode, not virtual-8086 mode *) 

IF (PE = 1 and VM = 0 and IA32_EFER.LMA = 0) and instruction = far return 
THEN 

IF OperandSize = 32 
THEN 

IF second doubleword on stack is not within stack limits 
THEN #SS(0); FI; 

ELSE (* OperandSize =16*) 

IF second word on stack is not within stack limits 
THEN #SS(0); FI; 

FI; 

IF return code segment selector is NULL 
THEN #GP(0); FI; 

IF return code segment selector addresses descriptor beyond descriptor table limit 
THEN #GP(selector); FI; 

Obtain descriptor to which return code segment selector points from descriptor table; 
IF return code segment descriptor is not a code segment 
THEN #GP(selector); FI; 

IF return code segment selector RPL < CPL 
THEN #GP(selector); FI; 

IF return code segment descriptor is conforming 
and return code segment DPL > return code segment selector RPL 
THEN #GP(selector); FI; 

IF return code segment descriptor is non-conforming and return code 
segment DPL return code segment selector RPL 
THEN #GP(selector); FI; 

IF return code segment descriptor is not present 
THEN #NP(selector); FI: 

IF return code segment selector RPL > CPL 

THEN GOTO RETURN-TO-OUTER-PRIVILEGE-LEVEL; 

ELSE GOTO RETURN-TO-SAME-PRIVILEGE-LEVEL; 

FI; 


RETURN-SAME-PRIVILEGE-LEVEL: 

IF the return instruction pointer is not within the return code segment limit 
THEN #GP(0); FI; 

IF OperandSize = 32 
THEN 

EIP ^ Pop(); 

CS <- Pop(); (* 32-bit pop, high-order 16 bits discarded *) 
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ELSE (* OperandSIze =16*) 

EIP ^ Pop(); 

EIP ^ EIP AND OOOOFFFFH; 

CS ^ Pop(); (* 16-blt pop *) 

FI; 

IF instruction has immediate operand 

TFIEN (* Release parameters from stack *) 
IF StackAddressSIze = 32 
THEN 

ESP ^ ESP + SRC; 

ELSE (* StackAddressSIze =16*) 
SP ^ SP + SRC; 

FI; 


RETURN-TO-OUTER-PRIVILEGE-LEVEL: 

IF top (16 + SRC) bytes of stack are not within stack limits (OperandSize = 32) 
or top (8 + SRC) bytes of stack are not within stack limits (OperandSize = 16) 

THEN #SS(0); FI; 

Read return segment selector; 

IF stack segment selector is NULL 
THEN #CP(0); FI; 

IF return stack segment selector Index is not within its descriptor table limits 
THEN #GP(selector); FI; 

Read segment descriptor pointed to by return segment selector; 

IF stack segment selector RPL RPL of the return code segment selector 
or stack segment is not a writable data segment 

or stack segment descriptor DPL RPL of the return code segment selector 
THEN #GP(selector); FI; 

IF stack segment not present 

THEN #SS(StackSegmentSelector); FI; 

IF the return Instruction pointer is not within the return code segment limit 
THEN #GP(0); FI; 

CPL <- ReturnCodeSegmentSelector(RPL); 

IF OperandSize = 32 
THEN 

EIP ^ Pop(); 

CS Pop(); (* 32-bit pop, high-order 16 bits discarded; segment descriptor loaded *) 
CS(RPL) ^ CPL; 

IF Instruction has immediate operand 

THEN (* Release parameters from called procedure's stack *) 

IF StackAddressSIze = 32 
THEN 

ESP ^ ESP + SRC; 

ELSE (* StackAddressSIze =16*) 

SP ^ SP + SRC; 

FI; 

FI; 

tempESP Pop(); 

tempSS <- Pop(); (* 32-blt pop, hIgh-order 16 bits discarded; seg. descriptor loaded *) 
ESP tempESP; 

SS <- tempSS; 

ELSE (* OperandSize =16*) 

EIP ^ Pop(); 
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EIP^EIPAND OOOOFFFFH; 

CS <- Pop(); (* 16-blt pop; segment descriptor loaded *) 

CS(RPL) ^ CPL; 

IF Instruction has Immediate operand 

TFIEN (* Release parameters from called procedure's stack *) 
IF StackAddressSize = 32 
THEN 

ESP ^ ESP + SRC; 

ELSE (* StackAddressSize =16*) 

SP ^ SP + SRC; 

FI; 

FI; 

tempESP Pop(); 

tempSS <- Pop(); (* 16-bit pop; segment descriptor loaded *) 

ESP tempESP; 

SS tempSS; 


FOR each of segment register (ES, FS, CS, and DS) 

DO 

IF segment register points to data or non-conforming code segment 
and CPL > segment descriptor DPL (* DPL in hidden part of segment register *) 
THEN SegmentSelector <- 0; (* Segment selector invalid *) 

FL¬ 


IP Instruction has Immediate operand 

THEN (* Release parameters from calling procedure's stack *) 

IF StackAddressSize = 32 
THEN 

ESP ^ ESP -H SRC; 

ELSE (* StackAddressSize =16*) 

SP^SP-rSRC; 

FI; 

FI; 

(* IA-32e Mode *) 

IF (PE = 1 and VM = 0 and IA32_EFER.LMA = 1) and Instruction = far return 
THEN 

IF OperandSize= 32 
THEN 

IF second doubleword on stack is not within stack limits 
THEN #SS(0); FI; 

IF first or second doubleword on stack is not in canonical space 
THEN #SS(0); FI; 

ELSE 

IF OperandSIze = 16 
THEN 

IF second word on stack is not within stack limits 
THEN #SS(0); FI; 

IF first or second word on stack is not in canonical space 
THEN #SS(0); FI; 

ELSE (* OperandSIze = 64 *) 

IF first or second guadword on stack Is not In canonical space 
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THEN #SS(0); FI; 

FI 

FI; 

IF return code segment selector Is NULL 
THEN GP(0); FI; 

IF return code segment selector addresses descriptor beyond descriptor table limit 
THEN GP(selector); FI; 

IF return code segment selector addresses descriptor in non-canonical space 
THEN GP(selector); FI; 

Obtain descriptor to which return code segment selector points from descriptor table; 

IF return code segment descriptor is not a code segment 
THEN #GP(selector); FI; 

IF return code segment descriptor has L-bit = 1 and D-bit = 1 
THEN #GP(selector); FI; 

IF return code segment selector RPL < CPL 
THEN #GP(selector); FI; 

IF return code segment descriptor is conforming 

and return code segment DPL > return code segment selector RPL 
THEN #GP(selector); FI; 

IF return code segment descriptor is non-conforming 

and return code segment DPL ^ return code segment selector RPL 
THEN #GP(selector); FI; 

IF return code segment descriptor is not present 
THEN #NP(selector); FI: 

IF return code segment selector RPL > CPL 

THEN GOTO IA-32E-MODE-RETURN-TO-OUTER-PRIVILEGE-LEVEL; 

ELSE GOTO IA-32E-MODE-RETURN-SAME-PRIVILEGE-LEVEL; 

FI; 


IA-32E-MODE-RETURN-SAME-PRIVILEGE-LEVEL: 

IF the return instruction pointer is not within the return code segment limit 
THEN #GP(0); FI; 

IF the return Instruction pointer is not within canonical address space 
THEN #GP(0); FI; 

IF OperandSize = 32 
THEN 

EIP ^ Pop(); 

CS Pop(); (* 32-blt pop, high-order 16 bits discarded *) 

ELSE 

IF OperandSize = 16 
THEN 

EIP ^ Pop(); 

EIP ^ EIP AND OOOOFFFFH; 

CS ^ Pop(); (* 16-blt pop *) 

ELSE (* OperandSize = 64 *) 

RIP ^ Pop(); 

CS Pop(); (* 64-blt pop, high-order 48 bits discarded *) 
FI; 

FI; 

IF Instruction has immediate operand 

THEN (* Release parameters from stack *) 

IF StackAddressSize = 32 
THEN 
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ESP ^ ESP + SRC; 

ELSE 

IF StackAddressSize = 16 
THEN 

SP^SP + SRC; 

ELSE (* StackAddressSize = 64 *) 

RSP ^ RSP + SRC; 

FI; 

FI; 

FI; 

IA-32E-MODE-RETURN-TO-OUTER-PRIVILEGE-LEVEL: 

IF top (16 + SRC) bytes of stack are not within stack limits (OperandSIze = 32) 
or top (8 + SRC) bytes of stack are not within stack limits (OperandSIze = 16) 

THEN #SS(0); FI; 

IF top (16 + SRC) bytes of stack are not In canonical address space (OperandSIze = 32) 
or top (8 + SRC) bytes of stack are not In canonical address space (OperandSIze = 16) 
or top (32 + SRC) bytes of stack are not In canonical address space (OperandSIze = 64) 

THEN #SS(0); FI; 

Read return stack segment selector; 

IF stack segment selector is NULL 
THEN 

IF new CS descriptor L-bIt = 0 
THEN #GP(selector); 

IF stack segment selector RPL = 3 
THEN #GP(selector); 

FI; 

IF return stack segment descriptor is not within descriptor table limits 
THEN #GP(selector); FI; 

IF return stack segment descriptor is in non-canonical address space 
THEN #GP(selector); FI; 

Read segment descriptor pointed to by return segment selector; 

IF stack segment selector RPL RPL of the return code segment selector 
or stack segment Is not a writable data segment 

or stack segment descriptor DPL RPL of the return code segment selector 
THEN #GP(selector); FI; 

IF stack segment not present 

THEN #SS(StackSegmentSelector); FI; 

IF the return instruction pointer is not within the return code segment limit 
THEN #GP(0); FI: 

IF the return instruction pointer is not within canonical address space 
THEN #GP(0); FI; 

CPL ReturnCodeSegmentSelector(RPL); 

IF OperandSIze = 32 
THEN 

EIP ^ Pop(); 

CS <- Pop(); (* 32-bit pop, high-order 16 bits discarded, segment descriptor loaded *) 
CS(RPL) ^ CPL; 

IF Instruction has Immediate operand 

THEN (* Release parameters from called procedure's stack *) 

IF StackAddressSize = 32 
THEN 

ESP ^ ESP + SRC; 

ELSE 
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IF StackAddressSIze = 16 
THEN 

SP^SP + SRC; 

ELSE (* StackAddressSIze = 64 *) 

RSP ^ RSP + SRC; 

FI; 

FI; 

FI; 

tempESP <- Pop(); 

tempSS Pop(); (* 32-blt pop, hlgh-order 16 bits discarded, segment descriptor loaded *) 
ESP tempESP; 

SS tempSS; 

ELSE 

IF OperandSize = 16 
THEN 

EIP ^ Pop(); 

EIP ^ EIP AND OOOOFFFFH; 

CS Pop(); (* 16-blt pop; segment descriptor loaded *) 

CS(RPL) ^ CPL; 

IF instruction has immediate operand 

THEN (* Release parameters from called procedure's stack *) 

IF StackAddressSIze = 32 
THEN 

ESP ^ ESP + SRC; 

ELSE 

IF StackAddressSIze = 16 
THEN 

SP^SP + SRC; 

ELSE (* StackAddressSIze = 64 *) 

RSP ^ RSP + SRC; 

FI; 

FI; 

FI; 

tempESP <- Pop(); 

tempSS Pop(); (* 16-blt pop; segment descriptor loaded *) 

ESP <- tempESP; 

SS tempSS; 

ELSE (* OperandSize = 64 *) 

RIP ^ Pop(); 

CS Pop(); (* 64-blt pop; hlgh-order 48 bits discarded; seg. descriptor loaded *) 
CS(RPL) ^ CPL; 

IF instruction has immediate operand 

THEN (* Release parameters from called procedure's stack *) 

RSP ^ RSP -H SRC; 

FI; 

tempESP <- Pop(); 

tempSS Pop(); (* 64-blt pop; hlgh-order 48 bits discarded; seg. desc. loaded *) 
ESP <- tempESP; 

SS tempSS; 

FI; 

FI; 

FOR each of segment register (ES, FS, GS, and DS) 

DO 
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IF segment register points to data or non-conforming code segment 
and CPL > segment descriptor DPI; (* DPI In hidden part of segment register *) 
TFIEN SegmentSelector 0; (* SegmentSelector invalid *) 

FI; 


IF instruction has immediate operand 

TFIEN (* Release parameters from calling procedure's stack *) 
IF StackAddressSize = 32 
THEN 

ESP ^ ESP -H SRC; 

ELSE 

IF StackAddressSize = 16 
THEN 

SP^SP + SRC; 

ELSE (* StackAddressSize = 64 *) 

RSP ^ RSP + SRC; 

FI; 

FI; 

FI; 

Flags Affected 

None. 


Protected Mode 

#GP(0) 

#GP(selector) 


#SS(0) 

#NP(selector) 

#PF(fault-code) 

#AC(0) 


Exceptions 

If the return code or stack segment selector NULL. 

If the return instruction pointer is not within the return code segment limit 
If the RPL of the return code segment selector is less then the CPL. 

If the return code or stack segment selector index is not within its descriptor table limits. 

If the return code segment descriptor does not indicate a code segment. 

If the return code segment is non-conforming and the segment selector's DPL is not equal to 
the RPL of the code segment's segment selector 

If the return code segment is conforming and the segment selector's DPL greater than the RPL 
of the code segment's segment selector 

If the stack segment is not a writable data segment. 

If the stack segment selector RPL is not equal to the RPL of the return code segment selector. 
If the stack segment descriptor DPL is not equal to the RPL of the return code segment 
selector. 

If the top bytes of stack are not within stack limits. 

If the return stack segment is not present. 

If the return code segment is not present. 

If a page fault occurs. 

If an unaligned memory access occurs when the CPL is 3 and alignment checking is enabled. 


Real-Address Mode Exceptions 

#GP If the return instruction pointer is not within the return code segment limit 

#SS If the top bytes of stack are not within stack limits. 


Virtual-SOSe Mode Exceptions 

#GP(0) If the return instruction pointer is not within the return code segment limit 
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#SS(0) If the top bytes of stack are not within stack limits. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If an unaligned memory access occurs when alignment checking is enabled. 

Compatibility Mode Exceptions 

Same as 64-bit mode exceptions. 


64-Bit Mode Exceptions 

#GP(0) If the return instruction pointer is non-canonical. 

If the return instruction pointer is not within the return code segment limit. 

If the stack segment selector is NULL going back to compatibility mode. 

If the stack segment selector is NULL going back to CPL3 64-bit mode. 

If a NULL stack segment selector RPL is not equal to CPL going back to non-CPL3 64-bit mode. 
If the return code segment selector is NULL. 

#GP(selector) If the proposed segment descriptor for a code segment does not indicate it is a code segment. 
If the proposed new code segment descriptor has both the D-bit and L-bit set. 

If the DPL for a nonconforming-code segment is not equal to the RPL of the code segment 
selector. 

If CPL is greater than the RPL of the code segment selector. 

If the DPL of a conforming-code segment is greater than the return code segment selector 
RPL. 


#SS(0) 

#NP(selector) 

#PF(fault-code) 

#AC(0) 


If a segment selector index is outside its descriptor table limits. 

If a segment descriptor memory address is non-canonical. 

If the stack segment is not a writable data segment. 

If the stack segment descriptor DPL is not equal to the RPL of the return code segment 
selector. 

If the stack segment selector RPL is not equal to the RPL of the return code segment selector. 
If an attempt to pop a value off the stack violates the SS limit. 

If an attempt to pop a value off the stack causes a non-canonical address to be referenced. 

If the return code or stack segment is not present. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 
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RORX — Rotate Right Logical Without Affecting Flags 


Opcode/ 

Instruction 

Op/ 

En 

64/32 

-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

VEX.LZ.F2.0F3A.W0 FO /r ib 
RORX r32, r/m32, imm8 

RMI 

V/V 

BMI2 

Rotate 32-bit r/m32 right /mmS times without affecting arithmetic 
flags. 

VEX.LZ.F2.0F3A.W1 FO /r ib 
RORX r64, r/m64, imm8 

RMI 

V/N.E. 

BMI2 

Rotate 64-bit r/m64 right /mmS times without affecting arithmetic 
flags. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

ImmB 

NA 


Description 

Rotates the bits of second operand right by the count value specified in imm8 without affecting arithmetic flags. 
The RORX instruction does not read or write the arithmetic flags. 

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in 
64-bit mode. In 64-bit mode operand size 64 requires VEX.Wl. VEX.Wl is ignored in non-64-bit modes. An 
attempt to execute this instruction with VEX.L not equal to 0 will cause #UD. 

Operation 

IF (OperandSIze = 32) 
y^imm8 AND 1FH; 

DEST^(SRC>>y)|(SRC 
ELSEIF (OperandSIze = 64) 
y ^ imm8 AND 3FH; 

DEST^(SRC>>y)|(SRC 
ENDIF 

Flags Affected 

None 

Intel C/C++ Compiler Intrinsic Equivalent 

Auto-generated from high-level language. 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Section 2.5.1, "Exception Conditions for VEX-Encoded GPR Instructions", Table 2-29; additionally 
#UD IfVEX.W=l. 


« (32-y)); 

«(64-y)); 
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ROUNDPD — Round Packed Double Precision Floating-Point Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 09 /r ib 

ROUNDPD xmml, xmm2/ml28, imm8 

RMI 

V/V 

SSE4_1 

Round packed double precision floating-point 
values in xmm2/ml28an6 place the result in 
xmmh The rounding mode is determined by 
imm8. 

VEX.128.66.0F3A.WIG 09 /r ib 

VROUNDPD xmmi, xmm2/m128, imm8 

RMI 

v/v 

AVX 

Round packed double-precision floating-point 
values in xmm2/ml28an6 place the result in 
xmmi. The rounding mode is determined by 
imm8. 

VEX.256.66.0F3A.WIG 09 /r ib 

VROUNDPD ymml, \/mm2/m256, imm8 

RMI 

V/V 

AVX 

Round packed double-precision floating-point 
values in \/mm2/m256 and place the result in 
ymml. The rounding mode is determined by 
imm8. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

imm8 

NA 


Description 

Round the 2 double-precision floating-point values in the source operand (second operand) using the rounding 
mode specified in the immediate operand (third operand) and place the results in the destination operand (first 
operand). The rounding process rounds each input floating-point value to an integer value and returns the integer 
result as a double-precision floating-point value. 

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in 
Figure 4-24. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the 
source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Table 4-18 lists the encoded 
values for rounding-mode field). 

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an 
SNaN then it will be converted to a QNaN. If DAZ is set to '1 then denormals will be converted to zero before 
rounding. 

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina¬ 
tion is not distinct from the first source XMM register and the upper bits (VLMAX-1:128) of the corresponding VMM 
register destination are unmodified. 

VEX. 128 encoded version: the source operand second source operand or a 128-bit memory location. The destina¬ 
tion operand is an XMM register. The upper bits (VLMAX-1:128) of the corresponding VMM register destination are 
zeroed. 

VEX.256 encoded version: The source operand is a VMM register or a 256-bit memory location. The destination 
operand is a VMM register. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 
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8 

3 

2 

1 0 

Reserved 





P — Precision Mask; 0: normal, 1: inexact 

RS — Rounding select; 1: MXCSR.RC, 0: ImmS.RC 

RC — Rounding mode - 


Figure 4-24. Bit Control Fields of Immediate Byte for ROUNDxx Instruction 


Table 4-18. Rounding Modes and Encoding of Rounding Control (RC) Field 


Rounding 

Mode 

RC Fieid 
Setting 

Description 

Round to 
nearest (even) 

OOB 

Rounded result is the closest to the infinitely precise result. If two values are egually close, the result is 
the even value (i.e., the integer value with the least-significant bit of zero). 

Round down 
(toward -o=) 

018 

Rounded result is closest to but no greater than the infinitely precise result. 

Round up 
(toward +~) 

108 

Rounded result is closest to but no less than the infinitely precise result. 

Round toward 
zero (Truncate) 

118 

Rounded result is closest to but no greater in absolute value than the infinitely precise result. 


Operation 

IF(imm[2] = '1) 

TFIEN // rounding mode is determined by MXCSR.RC 
DEST[63:0] ^ ConvertDPFPTolnteger_M(SRC[63:0]); 

DEST[127:64] ^ ConvertDPFPTolnteger_M(SRC[127:64]); 

ELSE // rounding mode is determined by IMM8.RC 

DEST[63:0] ^ ConvertDPFPTolntegerJmm(SRC[63:0]); 

DEST[127:64] ^ ConvertDPFPTolntegerJmm(SRC[127:64]); 

FI 

ROUNDPD (128-bit Legacy SSE version) 

DEST[63:0] ^ RoundTolnteger(SRC[63:0]], ROUND_CONTROL) 

DEST[127:64] ^ RoundTolnteger(SRC[127:64]], ROUND_CONTROL) 
DEST[VLMAX-1:128] (Unmodified) 

VROUNDPD (VEX.128 encoded version) 

DEST[63:0] ^ RoundTolnteger(SRC[63:0]], ROUND_CONTROL) 

DEST[127:64] ^ RoundTolnteger(SRC[127:64]], ROUND_CONTROL) 
DEST[VLMAX-1:128]^0 

VROUNDPD (VEX.256 encoded version) 

DEST[63:0] ^ RoundTolnteger(SRC[63:0], ROUND_CONTROL) 

DEST[127:64] ^ RoundTolnteger(SRC[127:64]], ROUND_CONTROL) 
DEST[191:128] ^ RoundTolnteger(SRC[191:128]], ROUND_CONTROL) 
DEST[255:192] ^ RoundTolnteger(SRC[255:192] ], R0UND_C0NTR0L) 
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Intel C/C++ Compiler Intrinsic Equivalent 

_ml 28_mm_round_pd(_ml 28d si, Int IRoundMode); 

_ml 28 _mm_floor_pd(_ml 28d si); 

_ml 28 _mm_cell_pd(_ml 28d si) 

_m256 _mm256_round_pd(_m256d si, int IRoundMode); 

_m256 _mm256_floor_pd(_m256d si); 

_m256 _mm256_ceil_pd(_m256d si) 

SIMD Floating-Point Exceptions 

Invalid (signaled only if SRC = SNaN) 

Precision (signaled only if imm[3] = '0; if imm[3] = '1, then the Precision Mask in the MXSCSR is ignored and preci¬ 
sion exception is not signaled.) 

Note that Denormal is not signaled by ROUNDPD. 

Other Exceptions 

See Exceptions Type 2; additionally 
#UD If VEX.vvvv iiiiB. 
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ROUNDPS — Round Packed Single Precision Floating-Point Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A 08 
/r ib 

ROUNDPS xmml, xmm2/ml28, imm8 

RMI 

V/V 

SSE4_1 

Round packed single precision floating-point 
values In xmm2/ml28 and place the result In 
xmml. The rounding mode is determined by 
imm8. 

VEX.128.66.0F3A.WIG 08/rib 

VROUNDPS xmml, xmm2/m128, imm8 

RMI 

v/v 

AVX 

Round packed single-precision floating-point 
values In xmm2/ml28and place the result In 
xmml. The rounding mode Is determined by 
imm8. 

VEX.256.66.0F3A.WIG 08 /r ib 

VROUNDPS ymmi, \/mm2/m256, imm8 

RMI 

V/V 

AVX 

Round packed single-precision floating-point 
values In ymm2/m256 and place the result In 
ymmi. The rounding mode is determined by 
imm8. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

Imm8 

NA 


Description 

Round the 4 single-precision floating-point values in the source operand (second operand) using the rounding 
mode specified in the immediate operand (third operand) and place the results in the destination operand (first 
operand). The rounding process rounds each input floating-point value to an integer value and returns the integer 
result as a single-precision floating-point value. 

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in 
Figure 4-24. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the 
source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Table 4-18 lists the encoded 
values for rounding-mode field). 

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an 
SNaN then it will be converted to a QNaN. If DAZ is set to '1 then denormals will be converted to zero before 
rounding. 

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina¬ 
tion is not distinct from the first source XMM register and the upper bits (VLMAX-1:128) of the corresponding VMM 
register destination are unmodified. 

VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destina¬ 
tion operand is an XMM register. The upper bits (VLMAX-1:128) of the corresponding VMM register destination are 
zeroed. 

VEX.256 encoded version: The source operand is a VMM register or a 256-bit memory location. The destination 
operand is a VMM register. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. 
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Operation 

IF(lmm[2] = '1) 

TFIEN // rounding mode is determined by MXCSR.RC 
DEST[31:0] ^ ConvertSPFPTolnteger_M(SRC[31:0]); 
DEST[63:32] ^ ConvertSPFPTolnteger_M(SRC[63:32]); 
DEST[95:64] ^ ConvertSPFPTolnteger_M(SRC[95:64]); 
DEST[127:96] ^ ConvertSPFPTolnteger_M(SRC[127:96]); 
ELSE // rounding mode is determined by IMM8.RC 

DEST[31:0] ^ ConvertSPFPTolntegerJmm(SRC[31:0]); 
DEST[63:32] ^ ConvertSPFPTolntegerJmm(SRC[63:32]); 
DEST[95:64] ^ ConvertSPFPTolntegerJmm(SRC[95:64]); 
DEST[127:96] ^ ConvertSPFPTolntegerJmm(SRC[127:96]); 


ROUNDPS(128-bit Legacy SSE version) 

DEST[31:0] ^ RoundTolnteger(SRC[31:0], ROUND_CONTROL) 
DEST[63:32] ^ RoundTolnteger(SRC[63:32], ROUND_CONTROL) 
DEST[95:64] ^ RoundTolnteger(SRC[95:64]], ROUND_CONTROL) 
DEST[127:96] ^ RoundTolnteger(SRC[127:96]], ROUND_CONTROL) 
DEST[VLMAX-1:128] (Unmodified) 


VROUNDPS (VEX.128 encoded version) 

DEST[31:0] ^ RoundTolnteger(SRC[31:0], ROUND_CONTROL) 
DEST[63:32] ^ RoundTolnteger(SRC[63:32], ROUND_CONTROL) 
DEST[95:64] ^ RoundTolnteger(SRC[95:64]], ROUND_CONTROL) 
DEST[127:96] ^ RoundTolnteger(SRC[127:96]], ROUND_CONTROL) 
DEST[VLMAX-1:128]^0 


VROUNDPS (VEX.256 encoded version) 

DEST[31:0] ^ RoundTolnteger(SRC[31:0], ROUND_CONTROL) 
DEST[63:32] ^ RoundTolnteger(SRC[63:32], ROUND_CONTROL) 
DEST[95:64] ^ RoundTolnteger(SRC[95:64]], ROUND_CONTROL) 
DEST[127:96] ^ RoundTolnteger(SRC[127:96]], ROUND_CONTROL) 
DEST[159:128] ^ RoundTolnteger(SRC[159:128]], ROUND_CONTROL) 
DEST[191:160] ^ RoundTolnteger(SRC[191:160]], ROUND_CONTROL) 
DEST[223:192] ^ RoundTolnteger(SRC[223:192] ], ROUND_CONTROL) 
DEST[255:224] ^ RoundTolnteger(SRC[255:224] ], ROUND_CONTROL) 

Intel C/C++ Compiler Intrinsic Equivalent 

_ml 28 _mm_round_ps(_ml 28 si, int iRoundMode); 

_ml 28 _mm_floor_ps(_ml 28 si); 

_ml 28 _mm_ceil_ps(_ml 28 si) 

_m256 _mm256_round_ps(_m256 si, int iRoundMode); 

_m256 _mm256_floor_ps(_m256 si); 

_m256 _mm256_ceil_ps(_m256 si) 
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SIMD Floating-Point Exceptions 

Invalid (signaled only if SRC = SNaN) 

Precision (signaled only if imm[3] = '0; if imm[3] = '1, then the Precision Mask in the MXSCSR is ignored and preci¬ 
sion exception is not signaled.) 

Note that Denormal is not signaled by ROUNDPS. 

Other Exceptions 

See Exceptions Type 2; additionally 
#UD If VEX.vvvv iiiiB. 
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ROUNDSD — Round Scalar Double Precision Floating-Point Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A OB/rib 

ROUNDSD xmml, xmm2/m64, imm8 

RMI 

V/V 

SSE4_1 

Round the low packed double precision 
floating-point value in xmm2/m64 and place 
the result in xmml. The rounding mode is 
determined by imm8. 

VEX.NDS.LIG.66.0F3A.WIG OB /r ib 

VROUNDSD xmml, xmmZ, xmm3/m64, immS 

RVMI 

v/v 

AVX 

Round the low packed double precision 
floating-point value in xmm3/m64 and place 
the result in xmml. The rounding mode is 
determined by imm8. Upper packed double 
precision floating-point value (bits[127:64]) 
from xmmZ is copied to xmm 7[127:64]. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 


Description 

Round the DP FP value in the lower qword of the source operand (second operand) using the rounding mode spec¬ 
ified in the immediate operand (third operand) and place the result in the destination operand (first operand). The 
rounding process rounds a double-precision floating-point input to an integer value and returns the integer result 
as a double precision floating-point value in the lowest position. The upper double precision floating-point value in 
the destination is retained. 

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in 
Figure 4-24. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the 
source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Table 4-18 lists the encoded 
values for rounding-mode field). 

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an 
SNaN then it will be converted to a QNaN. If DAZ is set to '1 then denormals will be converted to zero before 
rounding. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (VLMAX- 
1:64) of the corresponding VMM destination register remain unchanged. 

VEX.128 encoded version: Bits (VLMAX-1:128) of the destination VMM register are zeroed. 

Operation 

IF(imm[2] = '1) 

TFIEN // rounding mode is determined by MXCSR.RC 
DEST[63:0] ^ ConvertDPFPTolnteger_M(SRC[63:0]); 

ELSE // rounding mode is determined by IMM8.RC 

DEST[63:0] ^ ConvertDPFPTolntegerJmm(SRC[63:0]); 

FI; 

DEST[127:63] remains unchanged; 

ROUNDSD (128-bit Legacy SSE version) 

DEST[63:0] ^ RoundTolnteger(SRC[63:0], R0UND_C0NTR0L) 

DEST[VLMAX-1:64] (Unmodified) 
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VROUNDSD (VEX.128 encoded version) 

DEST[63:0] ^ RoundTolnteger(SRC2[63:0], ROUND_CONTROL) 

DEST[127:64] ^ SRC1 [127:64] 

DEST[VLMAX-1:128]^0 

Intel C/C++ Compiler Intrinsic Equivalent 

ROUNDSD: _ml28d mm_round_sd(_ml28d dst,_ml28d si, Int IRoundMode); 

_ml 28d mm_floor_sd(_ml 28d dst,_ml 28d si); 

_ml 28d mm_ceil_sd(_ml 28d dst,_ml 28d si); 

SIMD Floating-Point Exceptions 

Invalid (signaled only if SRC = SNaN) 

Precision (signaled only if imm[3] = '0; if imm[3] = '1, then the Precision Mask in the MXSCSR is ignored and preci¬ 
sion exception is not signaled.) 

Note that Denormal is not signaled by ROUNDSD. 

Other Exceptions 

See Exceptions Type 3. 
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ROUNDSS — Round Scalar Single Precision Floating-Point Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

66 OF 3A OA /r ib 

ROUNDSS xmml, xmm2/m32, immS 

RMI 

V/V 

SSE4_1 

Round the low packed single precision 
floating-point value in xmm2/m32 and place 
the result in xmml. The rounding mode is 
determined by immS. 

VEX.NDS.LIG.66.0F3A.WIG OA /r ib 

VROUNDSS xmml, xmm2, xmm3/m32, immS 

RVMI 

v/v 

AVX 

Round the low packed single precision 
floating-point value in xmm3/m32 and place 
the result in xmml. The rounding mode is 
determined by immS. Also, upper packed 
single precision floating-point values 
(bits[127:32]) from xmm2 are copied to 
xmm7[127:32]. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (w) 

ModRM:r/m (r) 

immS 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

imm8 


Description 

Round the single-precision floating-point value in the lowest dword of the source operand (second operand) using 
the rounding mode specified in the immediate operand (third operand) and place the result in the destination 
operand (first operand). The rounding process rounds a single-precision floating-point input to an integer value and 
returns the result as a single-precision floating-point value in the lowest position. The upper three single-precision 
floating-point values in the destination are retained. 

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in 
Figure 4-24. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the 
source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Table 4-18 lists the encoded 
values for rounding-mode field). 

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an 
SNaN then it will be converted to a QNaN. If DAZ is set to '1 then denormals will be converted to zero before 
rounding. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (VLMAX- 
1:32) of the corresponding VMM destination register remain unchanged. 

VEX.128 encoded version: Bits (VLMAX-1:128) of the destination VMM register are zeroed. 

Operation 

IF(imm[2] = '1) 

TFIEN // rounding mode is determined by MXCSR.RC 
DEST[31:0] ^ ConvertSPFPTolnteger_M(SRC[31:0]); 

ELSE // rounding mode is determined by IMM8.RC 

DEST[31:0] ^ ConvertSPFPTolntegerJmm(SRC[31:0]); 

FI; 

DEST[127:32] remains unchanged; 

ROUNDSS (128-bit Legacy SSE version) 

DEST[31:0] ^ RoundTolnteger(SRC[31:0], R0UND_C0NTR0L) 

DEST[VLMAX-1:32] (Unmodified) 
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VROUNDSS (VEX.128 encoded version) 

DEST[31:0] ^ RoundTolnteger(SRC2[31:0], ROUND_CONTROL) 

DEST[127:32] ^SRCI [127:32] 

DEST[VLMAX-1:128]^0 

Intel C/C++ Compiler Intrinsic Equivaient 

ROUNDSS: _ml 28 mm_round_ss(_ml 28 dst,_ml 28 si, int IRoundMode); 

_ml 28 mm_floor_ss(_ml 28 dst,_ml 28 si); 

_ml 28 mm_cell_ss(_ml 28 dst,_ml 28 si); 

SIMD Floating-Point Exceptions 

Invalid (signaled only if SRC = SNaN) 

Precision (signaled only if imm[3] = '0; if imm[3] = '1, then the Precision Mask in the MXSCSR is ignored and preci¬ 
sion exception is not signaled.) 

Note that Denormal is not signaled by ROUNDSS. 

Other Exceptions 

See Exceptions Type 3. 
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RSM—Resume from System Management Mode 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF AA 

RSM 

NP 

Valid 

Valid 

Resume operation of interrupted program. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Returns program control from system management mode (SMM) to the application program or operating-system 
procedure that was interrupted when the processor received an SMM interrupt. The processor's state is restored 
from the dump created upon entering SMM. If the processor detects invalid state information during state restora¬ 
tion, it enters the shutdown state. The following invalid information can cause a shutdown: 

• Any reserved bit of CR4 is set to 1. 

• Any illegal combination of bits in CRO, such as (PG=1 and PE=0) or (NW=1 and CD=0). 

• (Intel Pentium and Intel486™ processors only.) The value stored in the state dump base field is not a 32-KByte 
aligned address. 

The contents of the model-specific registers are not affected by a return from SMM. 

The SMM state map used by RSM supports resuming processor context for non-64-bit modes and 64-bit mode. 

See Chapter 34, "System Management Mode," in the Intel® 64 and IA-32 Architectures Software Developer's 
Manual, Volume 3C, for more information about SMM and the behavior of the RSM instruction. 

Operation 

ReturnFromSMM; 

IF (IA-32e mode supported) or (CPUID DlsplayFamlly_DlsplayModel = 06H_0CH ) 

THEN 

ProcessorState <- Restore(SMMDump(IA-32e SMM STATE MAP)); 

Else 

ProcessorState <- Restore(SMMDump(Non-32-Blt-Mode SMM STATE MAP)); 

FI 

Flags Affected 

All. 

Protected Mode Exceptions 

#UD If an attempt is made to execute this instruction when the processor is not in SMM. 

If the LOCK prefix is used. 

Real-Address Mode Exceptions 

Same exceptions as in protected mode. 

Virtual-SOSe Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 
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64-Bit Mode Exceptions 

Same exceptions as in protected mode. 


RSM—Resume from System Management Mode 
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RSQRTPS—Compute Reciprocals of Square Roots of Packed Single-Precision Floating-Point 
Values 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 52 /r 

RSQRTPS xmm 1, xmm2/m 128 

RM 

V/V 

SSE 

Computes the approximate reciprocals of the 
square roots of the packed single-precision 
floating-point values in xmm2/ml28 and 
stores the results in xmml. 

VEX.128.0F.WIG52/r 

VRSQRTPS xmm 1, xmm2/m 128 

RM 

v/v 

AVX 

Computes the approximate reciprocals of the 
square roots of packed single-precision values 
in xmm2/mem and stores the results in xmml. 

VEX.256.0F.WIG 52 /r 

VRSQRTPS ymm 1, ymm2/m256 

RM 

V/V 

AVX 

Computes the approximate reciprocals of the 
square roots of packed single-precision values 
in ymm2/mem and stores the results in ymml. 


Instruction Operand 

Encoding 

Qp/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Performs a SIMD computation of the approximate reciprocals of the square roots of the four packed single-preci¬ 
sion floating-point values in the source operand (second operand) and stores the packed single-precision floating¬ 
point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. 
The destination operand is an XMM register. See Figure 10-5 in the Intel® 64 and IA-32 Architectures Software 
Developer's Manual, Volume 1, for an illustration of a SIMD single-precision floating-point operation. 

The relative error for this approximation is: 

IRelative Error] < 1.5 * 2“^^ 

The RSQRTPS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is 
a 0.0, an <=0 of the sign of the source value is returned. A denormal source value is treated as a 0.0 (of the same 
sign). When a source value is a negative value (other than -0.0), a floating-point indefinite is returned. When a 
source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. 

In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers 
(XMM8-XMM15). 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (VLMAX-1:128) of the corresponding 
VMM register destination are unmodified. 

VEX.128 encoded version: the first source operand is an XMM register or 128-bit memory location. The destination 
operand is an XMM register. The upper bits (VLMAX-1:128) of the corresponding VMM register destination are 
zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand can be a VMM 
register or a 256-bit memory location. The destination operand is a VMM register. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 
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Operation 

RSQRTPS (128-bit Legacy SSE version) 

DEST[31:0] ^ APPROXIMATE(1/SQRT(SRC[31:0])) 
DEST[63:32] ^ APPROXIMATE(1/SQRT(SRC1 [63:32])) 
DEST[95:64] ^ APPROXIMATE(1/SQRT(SRC1 [95:64])) 
DEST[127:96] ^ APPROXIMATE(1/SQRT(SRC2[127:96])) 
DEST[VLMAX-1:128] (Unmodified) 


VRSQRTPS (VEX.128 encoded version) 

DEST[31:0] ^ APPROXIMATE(1/SQRT(SRC[31:0])) 
DEST[63:32] ^ APPROXIMATE(1/SQRT(SRC1 [63:32])) 
DEST[95:64] ^ APPROXIMATE(1/SQRT(SRC1 [95:64])) 
DEST[127:96] ^ APPROXIMATE(1/SQRT(SRC2[127:96])) 
DEST[VLMAX-1:128]^0 


VRSQRTPS (VEX.256 encoded version) 

DEST[31:0] ^ APPROXIMATE(1/SQRT(SRC[31:0])) 
DEST[63:32] ^ APPROXIMATE(1/SQRT(SRC1 [63:32])) 
DEST[95:64] ^ APPROXIMATE(1/SQRT(SRC1 [95:64])) 
DEST[127:96] ^ APPROXIMATE(1/SQRT(SRC2[127:96])) 
DEST[159:128] ^ APPROXIMATE(1 /SQRT(SRC2[159:128])) 
DEST[191:160] ^ APPR0XIMATE(1 /SQRT(SRC2[191:160])) 
DEST[223:192] ^ APPR0XIMATE(1/SQRT(SRC2[223:192])) 
DEST[255:224] ^ APPR0XIMATE(1/SQRT(SRC2[255:224])) 

Intel C/C++ Compiler Intrinsic Equivalent 

RSQRTPS: _ml 28 _mm_rsqrt_ps(_ml 28 a) 

RSQRTPS: _m256 _mm256_rsqrt_ps (_m256 a); 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 4; additionally 
#UD If VEX.vvvv iiiiB. 
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RSQRTSS—Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Fiag 

Description 

F3 0F52/r 

RSQRTSS xmml, xmm2/m32 

RM 

V/V 

SSE 

Computes the approximate reciprocal of the 
square root of the low single-precision 
floating-point value in xmm2/m32 and stores 
the results in xmml. 

VEX.NDS.LIG.F3.0F.WIG 52 /r 

VRSQRTSS xmm 7, xmm2, xmm3/m32 

RVM 

v/v 

AVX 

Computes the approximate reciprocal of the 
square root of the low single precision 
floating-point value in xmm3/m32 and stores 
the results in xmml. Also, upper single 
precision floating-point values (bits[127:32]) 
from xmm2 are copied to xmm7[127:32]. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Qperand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Computes an approximate reciprocal of the square root of the low single-precision floating-point value in the 
source operand (second operand) stores the single-precision floating-point result in the destination operand. The 
source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. 
The three high-order doublewords of the destination operand remain unchanged. See Figure 10-6 in the Intel® 64 
and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a scalar single-precision 
floating-point operation. 

The relative error for this approximation is: 

IRelative Error] < 1.5 * 2“^^ 

The RSQRTSS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is 
a 0.0, an of the sign of the source value is returned. A denormal source value is treated as a 0.0 (of the same 
sign). When a source value is a negative value (other than -0.0), a floating-point indefinite is returned. When a 
source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned. 

In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers 
(XMM8-XMM15). 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (VLMAX- 
1:32) of the corresponding VMM destination register remain unchanged. 

VEX.128 encoded version: Bits (VLMAX-1:128) of the destination VMM register are zeroed. 

Operation 

RSQRTSS (128-bit Legacy SSE version) 

DEST[31:0] ^ APPROXIMATE(1/SQRT(SRC2[31:0])) 

DEST[VLMAX-1:32] (Unmodified) 

VRSQRTSS (VEX.128 encoded version) 

DEST[31:0] ^ APPROXIMATE(1/SQRT(SRC2[31:0])) 

DEST[127:32] ^SRCI [127:32] 

DEST[VLMAX-1:128]^0 
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Intel C/C++ Compiler Intrinsic Equivalent 

RSQRTSS: _m128_mm_rsqrt_ss(_ml 28 a) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 5. 


RSQRTSS—Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value 
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SAHF—Store AH into Flags 


Opcode* 

Instruction 

Op/ 

Gn 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

9E 

SAHF 

NP 

Invalid* 

Valid 

Loads SF, ZF, AF, PF, and CF from AH into 
EFLAGS register. 


NOTES: 


* Valid in specific steppings. See Description section. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Loads the SF, ZF, AF, PF, and CF flags of the EFLAGS register with values from the corresponding bits in the AH 
register (bits 7, 6, 4, 2, and 0, respectively). Bits 1, 3, and 5 of register AH are ignored; the corresponding reserved 
bits (1, 3, and 5) in the EFLAGS register remain as shown in the "Operation" section below. 

This instruction executes as described above in compatibility mode and legacy mode. It is valid in 64-bit mode only 
if CPUID.80000001H:ECX.LAHF-SAHF[bit0] = 1. 

Operation 

IF IA-64 Mode 
THEN 

IF CPUID.80000001 H.ECX[0] = 1; 

THEN 

RFLAGS(SF:ZF:0:AF:0:PF:1 :CF) ^ AH; 

ELSE 

#UD; 

FI 

ELSE 

EFLAGS(SF:ZF:0:AF:0:PF:1 :CF) ^ AH; 

FI; 

Flags Affected 

The SF, ZF, AF, PF, and CF flags are loaded with values from the AH register. Bits 1, 3, and 5 of the EFLAGS register 
are unaffected, with the values remaining 1, 0, and 0, respectively. 

Protected Mode Exceptions 

None. 

Real-Address Mode Exceptions 

None. 

Virtual-SOSe Mode Exceptions 

None. 

Compatibility Mode Exceptions 

None. 
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64-Bit Mode Exceptions 

#UD If CPUID.80000001H.ECX[0] = 0. 

If the LOCK prefix is used. 


SAHF—Store AH into Flags 
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SAL/SAR/SHL/SHR-Shift 


Opcode*** 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

DO/4 

SAL r/mS, 1 

Ml 

Valid 

Valid 

Multiply r/mS by 2, once. 

REX + DO /4 

SAL r/m8** 1 

Ml 

Valid 

N.E. 

Multiply r/mSby 2, once. 

D2 /4 

SAL r/mS, CL 

MC 

Valid 

Valid 

Multiply r/mS by 2, CL times. 

REX + D2 /4 

SAL r/mS**, CL 

MC 

Valid 

N.E. 

Multiply r/mS by 2, CL times. 

CO /4 ib 

SAL r/m8, imm8 

Ml 

Valid 

Valid 

Multiply r/mSby 2, /mmS times. 

REX + CO /4 ib 

SAL r/m8** imm8 

Ml 

Valid 

N.E. 

Multiply r/mSby 2, /mmS times. 

D1 /4 

SAL r/m 7 6,1 

Ml 

Valid 

Valid 

Multiply r/m 7 6 by 2, once. 

D3 /4 

SAL r/m 7 6, CL 

MC 

Valid 

Valid 

Multiply r/m 76 by 2, CL times. 

Cl /4 ib 

SAL r/m 7 6, imm8 

Ml 

Valid 

Valid 

Multiply r/m 76 by 2, /mmS times. 

D1 /4 

SAL r/m32,1 

Ml 

Valid 

Valid 

Multiply r/m32 by 2, once. 

REX.W + D1 /4 

SAL r/m64,1 

Ml 

Valid 

N.E. 

Multiply r/m64 by 2, once. 

D3 /4 

SAL r/m32, CL 

MC 

Valid 

Valid 

Multiply r/m32 by 2, CL times. 

REX.W + D3 /4 

SAL r/m64, CL 

MC 

Valid 

N.E. 

Multiply r/m64 by 2, CL times. 

Cl /4 ib 

SAL r/m32, imm8 

Ml 

Valid 

Valid 

Multiply r/m32 by 2, /mmS times. 

REX.W + Cl /4 ib 

SAL r/m64, imm8 

Ml 

Valid 

N.E. 

Multiply r/m64 by 2, /mmS times. 

DO /7 

SAR r/mS, 1 

Ml 

Valid 

Valid 

Signed divide* r/mSby 2, once. 

REX + DO /7 

SAR r/m8**, 1 

Ml 

Valid 

N.E. 

Signed divide* r/mSby 2, once. 

D2 /7 

SAR r/mS, CL 

MC 

Valid 

Valid 

Signed divide* r/mS by 2, CL times. 

REX + D2 /7 

SAR r/m8**, CL 

MC 

Valid 

N.E. 

Signed divide* r/mS by 2, CL times. 

CO /7 ib 

SAR r/mS, imm8 

Ml 

Valid 

Valid 

Signed divide* r/mSby 2, /mmStime. 

REX + CO /7 ib 

SAR r/m8** imm8 

Ml 

Valid 

N.E. 

Signed divide* r/mSby 2, /mmS times. 

D1 /7 

SAR r/m 7 6,1 

Ml 

Valid 

Valid 

Signed divide* r/m 76 by 2, once. 

D3 /7 

SAR r/m 7 6, CL 

MC 

Valid 

Valid 

Signed divide* r/m 76 by 2, CL times. 

Cl /7 ib 

SAR r/m 7 6, imm8 

Ml 

Valid 

Valid 

Signed divide* r/m 76 by 2, /mmS times. 

D1 /7 

SAR r/m32,1 

Ml 

Valid 

Valid 

Signed divide* r/m32 by 2, once. 

REX.W + D1 /7 

SAR r/m64, 1 

Ml 

Valid 

N.E. 

Signed divide* r/m64 by 2, once. 

D3 /7 

SAR r/m32, CL 

MC 

Valid 

Valid 

Signed divide* r/m32 by 2, CL times. 

REX.W + D3 /7 

SAR r/m64, CL 

MC 

Valid 

N.E. 

Signed divide* r/m64 by 2, CL times. 

Cl /7 ib 

SAR r/m32, imm8 

Ml 

Valid 

Valid 

Signed divide* r/m32 by 2, /mmS times. 

REX.W + Cl /7 ib 

SAR r/m64, imm8 

Ml 

Valid 

N.E. 

Signed divide* r/m64 by 2, /mmS times 

DO /4 

SHL r/mS, 1 

Ml 

Valid 

Valid 

Multiply r/mSby 2, once. 

REX + DO /4 

SHL r/mS**, 1 

Ml 

Valid 

N.E. 

Multiply r/mSby 2, once. 

D2/4 

SHL r/mS, CL 

MC 

Valid 

Valid 

Multiply r/mS by 2, CL times. 

REX + D2 /4 

SHL r/mS**, CL 

MC 

Valid 

N.E. 

Multiply r/mS by 2, CL times. 

CO /4 ib 

SHL r/mS, imm8 

Ml 

Valid 

Valid 

Multiply r/mSby 2, /mmS times. 

REX + CO /4 ib 

SHL r/m8**, imm8 

Ml 

Valid 

N.E. 

Multiply r/mSby 2, /mmS times. 

D1 /4 

SHL r/m 7 6,1 

Ml 

Valid 

Valid 

Multiply r/m 76 by 2, once. 

D3/4 

SHL r/m 7 6, CL 

MC 

Valid 

Valid 

Multiply r/m 76 by 2, CL times. 

Cl /4 ib 

SHL r/m 7 6, imm8 

Ml 

Valid 

Valid 

Multiply r/m 76 by 2, /mmS times. 

D1 /4 

SHL r/m32,1 

Ml 

Valid 

Valid 

Multiply r/m32 by 2, once. 
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Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

REX.W + D1 /4 

SHL ^/m64,^ 

Ml 

Valid 

N.E. 

Multiply r/m64 by 2, once. 

D3 /4 

SHL r/m32, CL 

MC 

Valid 

Valid 

Multiply r/m32 by 2, CL times. 

REX.W + D3 /4 

SHL r/m64, CL 

MC 

Valid 

N.E. 

Multiply r/m64 by 2, CL times. 

Cl /4 ib 

SHL r/m32, imm8 

Ml 

Valid 

Valid 

Multiply r/m32 by 2, /mmS times. 

REX.W + Cl /4 ib 

SHL r/m64, imm8 

Ml 

Valid 

N.E. 

Multiply r/m64 by 2, /mmS times. 

DO /5 

SHR r/mS,1 

Ml 

Valid 

Valid 

Unsigned divide r/mS by 2, once. 

REX + DO /5 

SHR r/mS**, 1 

Ml 

Valid 

N.E. 

Unsigned divide r/mS by 2, once. 

D2 /5 

SHR r/mS, CL 

MC 

Valid 

Valid 

Unsigned divide r/mS by 2, CL times. 

REX + D2 /5 

SHR r/mS**, CL 

MC 

Valid 

N.E. 

Unsigned divide r/mS by 2, CL times. 

CO /5 ib 

SHR r/mS, imm8 

Ml 

Valid 

Valid 

Unsigned divide r/mS by 2, /mmS times. 

REX + CO /5 ib 

SHR r/m8**, imm8 

Ml 

Valid 

N.E. 

Unsigned divide r/mS by 2, /mmS times. 

D1 /5 

SHR r/m 16, 1 

Ml 

Valid 

Valid 

Unsigned divide r/m 16 by 2, once. 

D3 /5 

SHR r/m 7 6, CL 

MC 

Valid 

Valid 

Unsigned divide r/m 7 6 by 2, CL times 

Cl /5 ib 

SHR r/ml6, imm8 

Ml 

Valid 

Valid 

Unsigned divide r/m 7 6 by 2, imm8 times. 

D1 /5 

SHR r/m32, 1 

Ml 

Valid 

Valid 

Unsigned divide r/m32 by 2, once. 

REX.W + D1 /5 

SHR r/m64, 1 

Ml 

Valid 

N.E. 

Unsigned divide r/m64 by 2, once. 

D3 /5 

SHR r/m32, CL 

MC 

Valid 

Valid 

Unsigned divide r/m32 by 2, CL times. 

REX.W + D3 /5 

SHR r/m64, CL 

MC 

Valid 

N.E. 

Unsigned divide r/m64 by 2, CL times. 

Cl /5 ib 

SHR r/m32, imm8 

Ml 

Valid 

Valid 

Unsigned divide r/m32 by 2, /mmS times. 

REX.W + Cl /5 ib 

SHR r/m64, imm8 

Ml 

Valid 

N.E. 

Unsigned divide r/m64 by 2, /mmS times. 


NOTES: 


* Not the same form of division as IDIV; rounding is toward negative infinity. 

** In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 
***See IA-32 Architecture Compatibility section below. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

Ml 

ModRM:r/m (r, w) 

1 

NA 

NA 

MC 

ModRM:r/m (r, w) 

CL 

NA 

NA 

Ml 

ModRM:r/m (r, w) 

ImmB 

NA 

NA 


Description 

Shifts the bits in the first operand (destination operand) to the left or right by the number of bits specified in the 
second operand (count operand). Bits shifted beyond the destination operand boundary are first shifted into the CF 
flag, then discarded. At the end of the shift operation, the CF flag contains the last bit shifted out of the destination 
operand. 

The destination operand can be a register or a memory location. The count operand can be an immediate value or 
the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is 
limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used). A special opcode encoding is provided for a count of 1. 

The shift arithmetic left (SAL) and shift logical left (SHL) instructions perform the same operation; they shift the 
bits in the destination operand to the left (toward more significant bit locations). For each shift count, the most 
significant bit of the destination operand is shifted into the CF flag, and the least significant bit is cleared (see 
Figure 7-7 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1). 
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The shift arithmetic right (SAR) and shift logical right (SHR) instructions shift the bits of the destination operand to 
the right (toward less significant bit locations). For each shift count, the least significant bit of the destination 
operand is shifted into the CF flag, and the most significant bit is either set or cleared depending on the instruction 
type. The SHR instruction clears the most significant bit (see Figure 7-8 in the Intel® 64 and IA-32 Architectures 
Software Developer's Manual, Volume 1); the SAR instruction sets or clears the most significant bit to correspond 
to the sign (most significant bit) of the original value in the destination operand. In effect, the SAR instruction fills 
the empty bit position's shifted value with the sign of the unshifted value (see Figure 7-9 in the Intel® 64 and IA-32 
Architectures Software Developer's Manual, Volume 1). 

The SAR and SHR instructions can be used to perform signed or unsigned division, respectively, of the destination 
operand by powers of 2. For example, using the SAR instruction to shift a signed integer 1 bit to the right divides 
the value by 2. 

Using the SAR instruction to perform a division operation does not produce the same result as the IDIV instruction. 
The quotient from the IDIV instruction is rounded toward zero, whereas the "quotient" of the SAR instruction is 
rounded toward negative infinity. This difference is apparent only for negative numbers. For example, when the 
IDIV instruction is used to divide -9 by 4, the result is -2 with a remainder of -1. If the SAR instruction is used to 
shift -9 right by two bits, the result is -3 and the "remainder" is +3; however, the SAR instruction stores only the 
most significant bit of the remainder (in the CF flag). 

The OF flag is affected only on 1-bit shifts. For left shifts, the OF flag is set to 0 if the most-significant bit of the 
result is the same as the CF flag (that is, the top two bits of the original operand were the same); otherwise, it is 
set to 1. For the SAR instruction, the OF flag is cleared for all 1-bit shifts. For the SHR instruction, the OF flag is set 
to the most-significant bit of the original operand. 

In 64-bit mode, the instruction's default operation size is 32 bits and the mask width for CL is 5 bits. Using a REX 
prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W 
promotes operation to 64-bits and sets the mask width for CL to 6 bits. See the summary chart at the beginning of 
this section for encoding data and limits. 

IA-32 Architecture Compatibility 

The 8086 does not mask the shift count. However, all other IA-32 processors (starting with the Intel 286 processor) 
do mask the shift count to 5 bits, resulting in a maximum count of 31. This masking is done in all operating modes 
(including the virtual-8086 mode) to reduce the maximum execution time of the instructions. 

Operation 

IF 64-Blt Mode and using REX.W 
THEN 

countMASK ^ 3FH; 

ELSE 

countMASK ^ 1FH; 

FI 

tempCOUNT ^ (COUNT AND countMASK); 
tempDEST ^ DEST; 

WHILE (tempCOUNT ^ 0) 

DO 

IF instruction Is SAL or SHL 
THEN 

CF ^ MSB(DEST); 

ELSE (* Instruction Is SAR or SHR *) 

CF ^ LSB(DEST); 

FI; 

IF instruction Is SAL or SHL 
THEN 

DEST ^ DEST * 2; 

ELSE 

IF Instruction is SAR 
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THEN 

DEST DEST / 2; (* Signed divide, rounding toward negative Infinity *) 

ELSE (* Instruction is SHR *) 

DEST ^ DEST / 2; (* Unsigned divide *) 

FI; 

FI; 

tempCOUNT ^ tempCOUNT - 1; 

OD; 

(* Determine overflow for the various Instructions *) 

IF (COUNT and countMASK)= 1 
THEN 

IF Instruction Is SAL or SHL 
THEN 

OF ^ MSB(DEST) XOR CF; 

ELSE 

IF Instruction Is SAR 
THEN 

OF^O; 

ELSE (* Instruction is SHR *) 

OF ^ MSB(tempDEST); 

FI; 

FI; 

ELSE IF (COUNT AND countMASK) = 0 
THEN 

All flags unchanged; 

ELSE (* COUNT not 1 or 0 *) 

OF <- undefined; 

FI; 

FI; 

Flags Affected 

The CF flag contains the value of the last bit shifted out of the destination operand; it is undefined for SHL and SHR 
instructions where the count is greater than or equal to the size (in bits) of the destination operand. The OF flag is 
affected only for 1-bit shifts (see "Description" above); otherwise, it is undefined. The SF, ZF, and PF flags are set 
according to the result. If the count is 0, the flags are not affected. For a non-zero count, the AF flag is undefined. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 
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Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 
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SARX/SHLX/SHRX - Shift Without Affecting Flags 


Opcode/ 

Instruction 

Op/ 

En 

64/32 

-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

VEX.NDS.LZ.F3.0F38.W0 F7 /r 
SARX r32a, r/m32, r32b 

RMV 

V/V 

BMI2 

Shift r/m32 arithmetically right with count specified in r32b. 

VEX.NDS.LZ.66.0F38.W0 F7 /r 
SHLX r32a, r/m32, r32b 

RMV 

v/v 

BMI2 

Shift r/m32 logically left with count specified in r32b. 

VEX.NDS.LZ.F2.0F38.W0 F7 /r 
SHRX r32a, r/m32, r32b 

RMV 

V/V 

BMI2 

Shift r/m32 logically right with count specified in r32b. 

VEX.NDS.LZ.F3.0F38.W1 F7 /r 
SARX r64a, r/m64, r64b 

RMV 

V/N.E. 

BMI2 

Shift r/m64 arithmetically right with count specified in r64b. 

VEX.NDS.LZ.66.0F38.W1 F7 /r 
SHLX r64a, r/m64, r64b 

RMV 

V/N.E. 

BMI2 

Shift r/m64 logically left with count specified in r64b. 

VEX.NDS.LZ.F2.0F38.W1 F7 /r 
SHRX r64a, r/m64, r64b 

RMV 

V/N.E. 

BMI2 

Shift r/m64 logically right with count specified in r64b. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMV 

ModRM:reg (w) 

ModRM:r/m (r) 

VEX.vvvv (r) 

NA 


Description 

Shifts the bits of the first source operand (the second operand) to the left or right by a COUNT value specified in 
the second source operand (the third operand). The result is written to the destination operand (the first operand) 

The shift arithmetic right (SARX) and shift logical right (SHRX) instructions shift the bits of the destination operand 
to the right (toward less significant bit locations), SARX keeps and propagates the most significant bit (sign bit) 
while shifting. 

The logical shift left (SHLX) shifts the bits of the destination operand to the left (toward more significant bit loca¬ 
tions). 

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in 
64-bit mode. In 64-bit mode operand size 64 requires VEX.Wl. VEX.Wl is ignored in non-64-bit modes. An 
attempt to execute this instruction with VEX.L not equal to 0 will cause #UD. 

If the value specified in the first source operand exceeds OperandSize -1, the COUNT value is masked. 

SARX,SHRX, and SHLX instructions do not update flags. 

Operation 

TEMP^SRCI; 

IF VEX.Wl and CS.L = 1 
THEN 

countMASK ^3FH; 

ELSE 

countMASK ^IFH; 

FI 

COUNT ^ (SRC2 AND countMASK) 

DEST[OperandSlze -1] = TEMP[OperandSlze -1]; 

DO WHILE (COUNT * 0) 

IF Instruction is SHLX 
THEN 

DEST[] ^ DEST *2; 
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ELSE IF instruction is SFIRX 
THEN 

DEST[] ^ DEST /2; //unsigned divide 
ELSE // SARX 

DEST[] <- DEST /2; // signed divide, round toward negative infinity 
FI; 

COUNTS COUNT -1; 

OD 

Flags Affected 

None. 

Intel C/C++ Compiler Intrinsic Equivalent 

Auto-generated from high-level language. 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Section 2.5.1, "Exception Conditions for VEX-Encoded GPR Instructions", Table 2-29; additionally 
#UD IfVEX.W=l. 
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SBB—Integer Subtraction with Borrow 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

1C/5 

SBB AL, imm8 

1 

Valid 

Valid 

Subtract with borrow imm8 from AL. 

1 D iw 

SBB AX, imm16 

1 

Valid 

Valid 

Subtract with borrow imm 7 6 from AX. 

ID/d 

SBB EAX, immSZ 

1 

Valid 

Valid 

Subtract with borrow imm32 from EAX. 

REX.W + 1D /d 

SBB RAX, imm32 

1 

Valid 

N.E. 

Subtract with borrow sign-extended imm.32 
to 64-bits from RAX. 

80 /3 ib 

SBB r/mS, imm8 

Ml 

Valid 

Valid 

Subtract with borrow imm8 from r/mS. 

REX + 80 /3 ib 

SBB r/m8* imm8 

Ml 

Valid 

N.E. 

Subtract with borrow imm8 from r/mS. 

81 /3 iw 

SBB r/ml6, immlB 

Ml 

Valid 

Valid 

Subtract with borrow imm 7 6 from r/m 7 6. 

81 /3 id 

SBB r/m32, imm32 

Ml 

Valid 

Valid 

Subtract with borrow imm32 from r/m32. 

REX.W + 81/3 id 

SBB r/m64, imm32 

Ml 

Valid 

N.E. 

Subtract with borrow sign-extended imm32 to 
64-bits from r/m64. 

83 /3 ib 

SBB r/m 16, imm8 

Ml 

Valid 

Valid 

Subtract with borrow sign-extended imm8 
from r/m 7 6. 

83 /3 ib 

SBB r/m32, imm8 

Ml 

Valid 

Valid 

Subtract with borrow sign-extended imm8 
from r/m32. 

REX.W + 83 /3 ib 

SBB r/m64, imm8 

Ml 

Valid 

N.E. 

Subtract with borrow sign-extended imm8 
from r/m64. 

18/r 

SBB r/mS, r8 

MR 

Valid 

Valid 

Subtract with borrow r8 from r/mS. 

REX + 18 Ir 

SBB r/mS* rS 

MR 

Valid 

N.E. 

Subtract with borrow r8 from r/m8. 

19/r 

SBB r/m 7 6, r7 6 

MR 

Valid 

Valid 

Subtract with borrow r76from r/ml6. 

19/r 

SBB r/m32, r32 

MR 

Valid 

Valid 

Subtract with borrow r32 from r/m32. 

REX.W + 19 /r 

SBB r/m64, r64 

MR 

Valid 

N.E. 

Subtract with borrow r64 from r/m64. 

MK/r 

SBB r8, r/m8 

RM 

Valid 

Valid 

Subtract with borrow r/mS from r8. 

REX+ 1A/r 

SBB r8*, r/m8* 

RM 

Valid 

N.E. 

Subtract with borrow r/mS from r8. 

1B/r 

SBB r7 6, r/m76 

RM 

Valid 

Valid 

Subtract with borrow r/m 76 from rl6. 

1B/r 

SBB r32, r/m32 

RM 

Valid 

Valid 

Subtract with borrow r/m32 from r32. 

REX.W + 1B /r 

SBB r64, r/m64 

RM 

Valid 

N.E. 

Subtract with borrow r/m64 from r64. 


NOTES: 

* In 64-blt mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

1 

AL/AX/EAX/RAX 

imm8/16/32 

NA 

NA 

Ml 

ModRM:r/m (w) 

imm8/16/32 

NA 

NA 

MR 

ModRM:r/m (w) 

ModRM:reg (r) 

NA 

NA 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 
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Description 

Adds the source operand (second operand) and the carry (CF) flag, and subtracts the result from the destination 
operand (first operand). The result of the subtraction is stored in the destination operand. The destination operand 
can be a register or a memory location; the source operand can be an immediate, a register, or a memory location. 
(However, two memory operands cannot be used in one instruction.) The state of the CF flag represents a borrow 
from a previous subtraction. 

When an immediate value is used as an operand, it is sign-extended to the length of the destination operand 
format. 

The SBB instruction does not distinguish between signed or unsigned operands. Instead, the processor evaluates 
the result for both data types and sets the OF and CF flags to indicate a borrow in the signed or unsigned result, 
respectively. The SF flag indicates the sign of the signed result. 

The SBB instruction is usually executed as part of a multibyte or multiword subtraction in which a SUB instruction 
is followed by a SBB instruction. 

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See 
the summary chart at the beginning of this section for encoding data and limits. 

Operation 

BEST ^ (BEST - (SRC + CF)); 

Intel C/C++ Compiler Intrinsic Equivalent 

SBB: extern unsigned char _subborrow_u8(unsigned char cjn, unsigned char srcl, unsigned char src2, unsigned char *diff_out); 

SBB: extern unsigned char_subborrow_u16(unsigned char cJn, unsigned short srcl, unsigned short src2, unsigned short 

*diff_out); 

SBB: extern unsigned char _subborrow_u32(unsigned char cjn, unsigned int srcl, unsigned char int, unsigned int *diff_out); 

SBB: extern unsigned char _subborrow_u64(unsigned char cjn, unsigned_int64 srcl, unsigned_int64 src2, unsigned 

_int64 *diff_out); 

Flags Affected 

The OF, SF, ZF, AF, PF, and CF flags are set according to the result. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 
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\/irtual-8086 Mode Exceptions 


#GP(0) 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 

If the LOCK prefix is used but the destination is not a memory operand. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form 


#GP(0) 

#PF(fault-code) 

#AC(0) 

If the memory address is in a non-canonical form. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If the LOCK prefix is used but the destination is not a memory operand. 
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SCAS/SCASB/SCASW/SCASD-Scan String 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

AE 

SCAS mS 

NP 

Valid 

Valid 

Compare AL with byte at ES:(E)DI or RDI, then 
set status flags.* 

AF 

SCASm76 

NP 

Valid 

Valid 

Compare AX with word at ES:(E)DI or RDI, then 
set status flags.* 

AF 

SCAS m32 

NP 

Valid 

Valid 

Compare EAX with doubleword at ES(E)DI or 
RDI then set status flags.* 

REX.W + AF 

SCAS m64 

NP 

Valid 

N.E. 

Compare RAX with quadword at RDI or EDI 
then set status flags. 

AE 

SCASB 

NP 

Valid 

Valid 

Compare AL with byte at ES:(E)DI or RDI then 
set status flags.* 

AF 

SCASW 

NP 

Valid 

Valid 

Compare AX with word at ES:(E)DI or RDI then 
set status flags.* 

AF 

SCASD 

NP 

Valid 

Valid 

Compare EAX with doubleword at ES:(E)DI or 
RDI then set status flags.* 

REX.W + AF 

SCASQ 

NP 

Valid 

N.E. 

Compare RAX with quadword at RDI or EDI 
then set status flags. 


NOTES: 

* In 64-bit mode, only 64-bit (RDI) and 32-bit (EDI) address sizes are supported. In non-64-bit mode, only 32-bit (EDI) and 16-bit (Dl) 


address sizes are supported. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

In non-64-bit modes and in default 64-bit mode: this instruction compares a byte, word, doubleword or quadword 
specified using a memory operand with the value in AL, AX, or EAX. It then sets status flags in EFLAGS recording 
the results. The memory operand address is read from ES:(E)DI register (depending on the address-size attribute 
of the instruction and the current operational mode). Note that ES cannot be overridden with a segment override 
prefix. 

At the assembly-code level, two forms of this instruction are allowed. The explicit-operand form and the no-oper¬ 
ands form. The explicit-operand form (specified using the SCAS mnemonic) allows a memory operand to be speci¬ 
fied explicitly. The memory operand must be a symbol that indicates the size and location of the operand value. The 
register operand is then automatically selected to match the size of the memory operand (AL register for byte 
comparisons, AX for word comparisons, EAX for doubleword comparisons). The explicit-operand form is provided 
to allow documentation. Note that the documentation provided by this form can be misleading. That is, the 
memory operand symbol must specify the correct type (size) of the operand (byte, word, or doubleword) but it 
does not have to specify the correct location. The location is always specified by ES:(E)DI. 

The no-operands form of the instruction uses a short form of SCAS. Again, ES:(E)DI is assumed to be the memory 
operand and AL, AX, or EAX is assumed to be the register operand. The size of operands is selected by the 
mnemonic: SCASB (byte comparison), SCASW (word comparison), or SCASD (doubleword comparison). 

After the comparison, the (E)DI register is incremented or decremented automatically according to the setting of 
the DF flag in the EFLAGS register. If the DF flag is 0, the (E)DI register is incremented; if the DF flag is 1, the (E)DI 
register is decremented. The register is incremented or decremented by 1 for byte operations, by 2 for word oper¬ 
ations, and by 4 for doubleword operations. 

SCAS, SCASB, SCASW, SCASD, and SCASQ can be preceded by the REP prefix for block comparisons of ECX bytes, 
words, doublewords, or quadwords. Often, however, these instructions will be used in a LOOP construct that takes 
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some action based on the setting of status flags. See "REP/REPE/REPZ /REPNE/REPNZ—Repeat String Operation 
Prefix" in this chapter for a description of the REP prefix. 

In 64-bit mode, the instruction's default address size is 64-bits, 32-bit address size is supported using the prefix 
67H. Using a REX prefix in the form of REX.W promotes operation on doubleword operand to 64 bits. The 64-bit no¬ 
operand mnemonic is SCASQ. Address of the memory operand is specified in either RDI or EDI, and 
AL/AX/EAX/RAX may be used as the register operand. After a comparison, the destination register is incremented 
or decremented by the current operand size (depending on the value of the DF flag). See the summary chart at the 
beginning of this section for encoding data and limits. 

Operation 

Non-64-bit Mode: 

IF (Byte comparison) 

THEN 

temp <- AL - SRC; 

SetStatusFlags(temp); 

THENIFDF = 0 

THEN (E)DI^(E)DI + 1; 

ELSE (E)DI^(E)DI-1;FI; 

ELSE IF (Word comparison) 

THEN 

temp AX - SRC; 

SetStatusFlags(temp); 

IFDF = 0 

THEN (E)DI ^ (E)DI + 2; 

ELSE (E)DI ^ (E)DI - 2; FI; 

FI; 

ELSE IF (Doubleword comparison) 

THEN 

temp ^ EAX - SRC; 

SetStatusFlags(temp); 

IFDF=0 

THEN (E)DI ^ (E)DI + 4; 

ELSE (E)DI ^ (E)DI - 4; FI; 

FI; 

FI; 

64-blt Mode: 

IF (Byte cmparlson) 

THEN 

temp AL - SRC; 

SetStatusFlags(temp); 

THENIFDF = 0 

THEN (R|E)DI^(R|E)DI + 1; 

ELSE (R|E)DI^(R|E)DI- 1;FI; 

ELSE IF (Word comparison) 

THEN 

temp AX - SRC; 

SetStatusFlags(temp); 

IFDF = 0 

THEN (R|E)DI ^ (R|E)DI + 2; 

ELSE (R|E)DI ^ (R|E)DI - 2; FI; 

FI; 
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ELSE IF (Doubleword comparison) 

THEN 

temp ^ EAX - SRC; 

SetStatusFlags(temp); 

IF DF = 0 

THEN (R|E)DI^(R|E)DI + 4; 

ELSE (R|E)DI ^ (R|E)DI - 4; FI; 

FI; 

ELSE IF (Quadword comparison using REX.W) 

THEN 

temp ^ RAX - SRC; 

SetStatusFlags(temp); 

IFDF = 0 

THEN (R|E)DI ^ (R|E)DI + 8; 

ELSE (R|E)DI ^ (R|E)DI - 8; 

FI; 

FI; 

F 

Flags Affected 

The OF, SF, ZF, AF, PF, and CF flags are set according to the temporary result of the comparison. 

Protected Mode Exceptions 

#GP(0) If a memory operand effective address is outside the limit of the ES segment. 

If the ES register contains a NULL segment selector. 

If an illegal memory operand effective address in the ES segment is given. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 
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64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 
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SETcc—Set Byte on Condition 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 97 

SETA r/mS 

M 

Valid 

Valid 

Set byte if above (CF=0 and ZF=0). 

REX + OF 97 

SETA r/mS* 

M 

Valid 

N.E. 

Set byte if above (CF=0 and ZF=0). 

OF 93 

SETAE r/mS 

M 

Valid 

Valid 

Set byte if above or equal (CF=0). 

REX + OF 93 

SETAE r/mS* 

M 

Valid 

N.E. 

Set byte If above or equal (CF=0). 

OF 92 

SETB r/mS 

M 

Valid 

Valid 

Set byte if below (CF=1). 

REX + OF 92 

SETB r/mS* 

M 

Valid 

N.E. 

Set byte if below (CF=1). 

OF 96 

SETBE r/mS 

M 

Valid 

Valid 

Set byte if below or equal (CF=1 or ZF=1). 

REX + OF 96 

SETBE r/mS* 

M 

Valid 

N.E. 

Set byte if below or equal (CF=1 or ZF=1). 

OF 92 

SETC r/mS 

M 

Valid 

Valid 

Set byte if carry (CF=1). 

REX + OF 92 

SETC r/mS* 

M 

Valid 

N.E. 

Set byte if carry (CF=1). 

OF 94 

SETE r/mS 

M 

Valid 

Valid 

Set byte if equal (ZF=1). 

REX + OF 94 

SETE r/mS* 

M 

Valid 

N.E. 

Set byte if equal (ZF=1). 

OF 9F 

SETC r/mS 

M 

Valid 

Valid 

Set byte if greater (ZF=0 and SF=OF). 

REX + OF 9F 

SETG r/mS* 

M 

Valid 

N.E. 

Set byte if greater (ZF=0 and SF=OF). 

OF 9D 

SETCE r/mS 

M 

Valid 

Valid 

Set byte if greater or equal (SF=OF). 

REX + OF 90 

SETGE r/mS* 

M 

Valid 

N.E. 

Set byte If greater or equal (SF=OF). 

OF 9C 

SETL r/mS 

M 

Valid 

Valid 

Set byte if less (SF?i: OF). 

REX + OF 9C 

SETL r/mS* 

M 

Valid 

N.E. 

Set byte if less (SF?i: OF). 

OF 9E 

SETLE r/mS 

M 

Valid 

Valid 

Set byte If less or equal (ZF=1 or SF?i: OF). 

REX + OF 9E 

SETLE r/mS* 

M 

Valid 

N.E. 

Set byte if less or equal (ZF=1 or SF?i: OF). 

OF 96 

SETNA r/m8 

M 

Valid 

Valid 

Set byte If not above (CF=1 or ZF=1). 

REX + OF 96 

SETNA r/m8* 

M 

Valid 

N.E. 

Set byte if not above (CF=1 or ZF=1). 

OF 92 

SETNAE r/m8 

M 

Valid 

Valid 

Set byte if not above or equal (CF=1). 

REX + OF 92 

SETNAE r/m8* 

M 

Valid 

N.E. 

Set byte if not above or equal (CF=1). 

OF 93 

SETNB r/mS 

M 

Valid 

Valid 

Set byte if not below (CF=0). 

REX + OF 93 

SETNB r/mS* 

M 

Valid 

N.E. 

Set byte if not below (CF=0). 

OF 97 

SETNBE r/m8 

M 

Valid 

Valid 

Set byte if not below or equal (CF=0 and 

ZF=0). 

REX + OF 97 

SETNBE r/m8* 

M 

Valid 

N.E. 

Set byte if not below or equal (CF=0 and 

ZF=0). 

OF 93 

SETNC r/m8 

M 

Valid 

Valid 

Set byte if not carry (CF=0). 

REX + OF 93 

SETNC r/m8* 

M 

Valid 

N.E. 

Set byte if not carry (CF=0). 

OF 95 

SETNE r/m8 

M 

Valid 

Valid 

Set byte if not equal (ZF=0). 

REX + OF 95 

SETNE r/m8* 

M 

Valid 

N.E. 

Set byte if not equal (ZF=0). 

OF 9E 

SETNG r/m8 

M 

Valid 

Valid 

Set byte If not greater (ZF=1 or SF?i: OF) 

REX + OF 9E 

SETNG r/m8* 

M 

Valid 

N.E. 

Set byte if not greater (ZF=1 or SF?i: OF). 

OF 9C 

SETNGE r/m8 

M 

Valid 

Valid 

Set byte if not greater or equal (SF?i: OF). 

REX + OF 9C 

SETNGE r/mS* 

M 

Valid 

N.E. 

Set byte if not greater or equal (SF?i: OF). 

OF 90 

SETNL r/m8 

M 

Valid 

Valid 

Set byte if not less (SF=OF). 

REX + OF 90 

SETNL r/mS* 

M 

Valid 

N.E. 

Set byte if not less (SF=OF). 

OF 9F 

SETNLE r/m8 

M 

Valid 

Valid 

Set byte if not less or equal (ZF=0 and SF=OF). 
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Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

REX + OF 9F 

SETNLE r/mS* 

M 

Valid 

N.E. 

Set byte if not less or equal (ZF=0 and SF=0F). 

OF 91 

SETNO r/mS 

M 

Valid 

Valid 

Set byte if not overflow (OF=0). 

REX + OF 91 

SETNO r/mS* 

M 

Valid 

N.E. 

Set byte if not overflow (OF=0). 

OF 9B 

SETNP r/m8 

M 

Valid 

Valid 

Set byte if not parity (PF=0). 

REX + OF 9B 

SETNP r/mS* 

M 

Valid 

N.E. 

Set byte if not parity (PF=0). 

OF 99 

SETNS r/mS 

M 

Valid 

Valid 

Set byte if not sign (SF=0). 

REX + OF 99 

SETNS r/mS* 

M 

Valid 

N.E. 

Set byte if not sign (SF=0). 

OF 95 

SETNZ r/mQ 

M 

Valid 

Valid 

Set byte if not zero (ZF=0). 

REX + OF 95 

SETNZ r/mQ* 

M 

Valid 

N.E. 

Set byte if not zero (ZF=0). 

OF 90 

SETO r/mQ 

M 

Valid 

Valid 

Set byte if overflow (0F=1) 

REX + OF 90 

SETO r/mQ* 

M 

Valid 

N.E. 

Set byte if overflow (0F=1). 

OF 9A 

SETP r/mQ 

M 

Valid 

Valid 

Set byte if parity (PF=1). 

REX + OF 9A 

SETP r/mQ* 

M 

Valid 

N.E. 

Set byte if parity (PF=1). 

OF 9A 

SETPE r/mQ 

M 

Valid 

Valid 

Set byte if parity even (PF=1). 

REX + OF 9A 

SETPE r/mQ* 

M 

Valid 

N.E. 

Set byte if parity even (PF=1). 

OF 9B 

SETPO r/mQ 

M 

Valid 

Valid 

Set byte if parity odd (PF=0). 

REX + OF 9B 

SETPO r/mQ* 

M 

Valid 

N.E. 

Set byte if parity odd (PF=0). 

OF 98 

SETS r/mQ 

M 

Valid 

Valid 

Set byte if sign (SF=1). 

REX + OF 98 

SETS r/mQ* 

M 

Valid 

N.E. 

Set byte if sign (SF=1). 

OF 94 

SETZ r/mQ 

M 

Valid 

Valid 

Set byte if zero (ZF=1). 

REX + OF 94 

SETZ r/mQ* 

M 

Valid 

N.E. 

Set byte if zero (ZF=1). 

NOTES: 

* In 64-bit mode, r/mB can not be encoded to access the following byte registers if a REX prefix is used: AFI, BFI, CFI, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (r) 

NA 

NA 

NA 


Description 

Sets the destination operand to 0 or 1 depending on the settings of the status flags (CF, SF, OF, ZF, and PF) in the 
EFLAGS register. The destination operand points to a byte register or a byte in memory. The condition code suffix 
(cc) indicates the condition being tested for. 

The terms "above" and "below" are associated with the CF flag and refer to the relationship between two unsigned 
integer values. The terms "greater" and "less" are associated with the SF and OF flags and refer to the relationship 
between two signed integer values. 

Many of the SETcc instruction opcodes have alternate mnemonics. For example, SETG (set byte if greater) and 
SETNLE (set if not less or equal) have the same opcode and test for the same condition: ZF equals 0 and SF equals 
OF. These alternate mnemonics are provided to make code more intelligible. Appendix B, "EFLAGS Condition 
Codes," in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, shows the alternate 
mnemonics for various test conditions. 

Some languages represent a logical one as an integer with all bits set. This representation can be obtained by 
choosing the logically opposite condition for the SETcc instruction, then decrementing the result. For example, to 
test for overflow, use the SETNO instruction, then decrement the result. 


SETcc—Set Byte on Condition 


Vol. 2B 4-597 





































INSTRUCTION SET REFERENCE, M-U 


In IA-64 mode, the operand size is fixed at 8 bits. Use of REX prefix enable uniform addressing to additional byte 
registers. Otherwise, this instruction's operation is the same as in legacy mode and compatibility mode. 

Operation 

IF condition 

THEN DEST^I; 

ELSE DEST ^ 0; 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 


#SS(0) 

#PF(fault-code) 

#UD 


If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register contains a NULL segment selector. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If the LOCK prefix is used. 


Real-Address Mode 


Exceptions 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If the LOCK prefix is used. 


#GP 

#SS 

#UD 


Virtual-SOSe Mode Exceptions 


#GP(0) 

#SS(0) 

#PF(fault-code) 

#UD 


If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If the LOCK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 


64-Bit Mode Exceptions 


#SS(0) 

#GP(0) 

#PF(fault-code) 

#UD 


If a memory address referencing the SS segment is in a non-canonical form. 
If the memory address is in a non-canonical form. 

If a page fault occurs. 

If the LOCK prefix is used. 
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SFENCE—Store Fence 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF AE F8 

SFENCE 

NP 

Valid 

Valid 

Serializes store operations. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction. 
This serializing operation guarantees that every store instruction that precedes the SFENCE instruction in program 
order becomes globally visible before any store instruction that follows the SFENCE instruction. The SFENCE 
instruction is ordered with respect to store instructions, other SFENCE instructions, any LFENCE and MFENCE 
instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to load 
instructions. 

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as 
out-of-order issue, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or 
knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. 
The SFENCE instruction provides a performance-efficient way of ensuring store ordering between routines that 
produce weakly-ordered results and routines that consume this data. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Specification of the instruction's opcode above indicates a ModR/M byte of F8. For this instruction, the processor 
ignores the r/m field of the ModR/M byte. Thus, SFENCE is encoded by any opcode of the form OF AE Fx, where x 
is in the range 8-F. 

Operation 

Wait_On_Following_Stores_Until(preceding_stores_globally_visible); 

Intel C/C++ Compiler Intrinsic Equivalent 

void _mm_sfence(void) 

Exceptions (All Operating Modes) 

#UD If CPUID.01H:EDX.SSE[bit 25] = 0. 

If the LOCK prefix is used. 


SFENCE—Store Fence 
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SGDT—Store Global Descriptor Table Register 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 /O 

SGDTm 

M 

Valid 

Valid 

Store GDTR to m. 


NOTES: 

* See IA-32 Architecture Compatibility section below. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Stores the content of the global descriptor table register (GDTR) in the destination operand. The destination 
operand specifies a memory location. 

In legacy or compatibility mode, the destination operand is a 6-byte memory location. If the operand-size attribute 
is 16 bits, the limit is stored in the low 2 bytes and the 24-bit base address is stored in bytes 3-5, and byte 6 is zero- 
filled. If the operand-size attribute is 32 bits, the 16-bit limit field of the register is stored in the low 2 bytes of the 
memory location and the 32-bit base address is stored in the high 4 bytes. 

In IA-32e mode, the operand size is fixed at 8-1-2 bytes. The instruction stores an 8-byte base and a 2-byte limit. 

SGDT is useful only by operating-system software. However, it can be used in application programs without causing 
an exception to be generated if CR4.UMIP = 0. See "LGDT/LIDT—Load Global/Interrupt Descriptor Table Register" 
in Chapter 3, Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, for information on 
loading the GDTR and IDTR. 

IA-32 Architecture Compatibility 

The 16-bit form of the SGDT is compatible with the Intel 286 processor if the upper 8 bits are not referenced. The 
Intel 286 processor fills these bits with Is; processor generations later than the Intel 286 processor fill these bits 
with Os. 

Operation 

IF Instruction is SGDT 

IF OperandSize= 16 
THEN 

DEST[0:15]^GDTR(Limit); 

DEST[16:39] ^ GDTR(Base); (* 24 bits of base address stored *) 

DEST[40:47] ^ 0; 

ELSE IF (32-bit Operand Size) 

DEST[0:15]^GDTR(Limit); 

DEST[16:47] ^ GDTR(Base); (* Full 32-bit base address stored *) 

FI; 

ELSE (* 64-blt Operand Size *) 

DEST[0:15]^GDTR(Llmlt); 

DEST[16:79] ^ GDTR(Base); (* Full 64-blt base address stored *) 

FI; 

FI; 

Flags Affected 

None. 
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Protected Mode Exceptions 

#UD If the destination operand is a register. 

If the LOCK prefix is used. 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If CR4.UMIP = 1 and GPL > 0. 


#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while GPL = 3. 


Real-Address Mode Exceptions 

#UD If the destination operand is a register. 

If the LOCK prefix is used. 

#GP If a memory operand effective address 

#SS If a memory operand effective address 


is outside the CS, DS, ES, FS, or GS segment limit, 
is outside the SS segment limit. 


\/irtual-8086 Mode Exceptions 

#UD If the destination operand is a register. 

If the LOCK prefix is used. 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If CR4.UMIP = 1. 


#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#UD If the destination operand is a register. 

If the LOCK prefix is used. 

#GP(0) If the memory address is in a non-canonical form. 

If CR4.UMIP = 1 and CPL > 0. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 


SCOT—Store Global Descriptor Table Register 
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SHAl RNDS4—Perform Four Rounds of SHAl Operation 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 3A CC /r ib 

SHA1RNDS4 xmmi, 
xmm2/m128, imm8 

RMI 

V/V 

SHA 

Performs four rounds of SHAl operation operating on SHAl state 
(A,B,C,D) from xmmi, with a pre-computed sum of the next 4 
round message dwords and state variable E from xmm2/m128. 

The immediate byte controls logic functions and round constants. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

Imm8 


Description 

The SHA1RNDS4 instruction performs four rounds of SHAl operation using an initial SHAl state (A,B,C,D) from the 
first operand (which is a source operand and the destination operand) and some pre-computed sum of the next 4 
round message dwords, and state variable E from the second operand (a source operand). The updated SHAl state 
(A,B,C,D) after four rounds of processing is stored in the destination operand. 

Operation 

SHA1RNDS4 

The function f() and Constant K are dependent on the value of the immediate. 

IF(imm8[1:0] = 0) 

THEN f() ^ f0(), K ^ Kq; 

ELSEIF(imm8[1:0] = 1 ) 

THEN f()^f1(),K^Ki; 

ELSEIF(imm8[1:0] = 2) 

THEN f() ^ f2(), K ^ < 2 ; 

ELSEIF(imm8[1:0] = 3) 

THEN f() ^ f3(), K ^ K3; 

FI; 

A ^SRCI [127:96]; 

B^SRCI [95:64]; 

C^SRC1[63:32]; 

D ^SRC1[31:0]; 

WqE ^ SRC2[127:96]; 

Wi ^ SRC2[95:64]; 

W 2 ^ SRC2[63:32]; 

W 3 ^SRC2[31:0]; 

Round I = 0 operation: 

A_1 ^ f (B, C, D) + (A ROL 5) +WoE +K; 

B_1 ^ A; 

C_1 ^ B ROL 30; 

D_1 ^ C; 

E_1 ^ D; 

FOR I = 1 to 3 

A_(i +1) ^ f (B_i, CJ, DJ) + (AJ ROL 5) +W|+ EJ +K; 

B_(i +1) ^ A_i; 


4-602 Vol. 2B 


SHAl RNDS4—Perform Four Rounds of SHAl Operation 













INSTRUCTION SET REFERENCE, M-U 


C_(l+1)^BJROL 30; 

D_(i+1)^CJ; 

E_(l +1) ^ DJ; 

ENDFOR 

DEST[127:96] ^ A_4; 

DEST[95:64] ^ B_4; 

DEST[63:32] ^ C_4; 

DEST[31:0]^D_4; 

Intel C/C++ Compiler Intrinsic Equivalent 

SFIA1RNDS4:_ml 281 _mm_sha1 rnds4_epu32(_ml 28i,_ml 281, const int); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHAl NEXTE—Calculate SHAl State Variable E after Four Rounds 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 38 C8 /r 

SHAINEXTE xmmi, 
xmm2/m128 

RM 

V/V 

SHA 

Calculates SHAl state variable E after four rounds of operation 
from the current SHAl state variable A in xmmi. The calculated 
value of the SHAl state variable E is added to the scheduled 
dwords in xmm2/m128, and stored with some of the scheduled 
dwords in xmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 


Description 

The SHAINEXTE calculates the SHAl state variable E after four rounds of operation from the current SHAl state 
variable A in the destination operand. The calculated value of the SHAl state variable E is added to the source 
operand, which contains the scheduled dwords. 

Operation 

SHAINEXTE 

TMP ^ (SRC1 [127:96] ROL 30); 

DEST[127:96] ^ SRC2[127:96] + TMP; 

DEST[95:64] ^ SRC2[95:64]; 

DEST[63:32] ^ SRC2[63:32]; 

DEST[31:0] ^SRC2[31:0]; 

Intel C/C++ Compiler Intrinsic Equivalent 

SHAl NEXTE: _m128i_mm_sha1nexte_epu32(_m1281, _m128i); 

Flags Affected 
None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHAl MSGl —Perform an Intermediate Calculation for the Next Four SHAl Message Dwords 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 38 C9 /r 

RM 

V/V 

SHA 

Performs an intermediate calculation for the next four SHAl 

SHAIMSGI xmmi. 




message dwords using previous message dwords from xmmi and 

xmm2/m128 




xmm2/m128, storing the result in xmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 


Description 

The SHAIMSGI instruction is one of two SHAl message scheduling instructions. The instruction performs an inter¬ 
mediate calculation for the next four SHAl message dwords. 

Operation 

SHAIMSGI 

WO ^SRCI [127:96]; 

W1 ^SRCI [95:64]; 

W2 ^SRCI [63:32] ; 

W3^SRC1[31:0]; 

W4 ^ SRC2[127:96]; 

W5 ^ SRC2[95:64]; 

DEST[127:96] ^ W2 XOR WO; 

DEST[95:64] ^ W3 X0RW1; 

DEST[63:32] ^ W4 XOR W2; 

DEST[31:0] ^ W5 XOR W3; 

Intel C/C++ Compiler Intrinsic Equivalent 

SHAl MSC1:_ml 281 _mm_sha1 msg1_epu32(_ml 281,_ml 28i); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHAl MSG2—Perform a Final Calculation for the Next Four SHAl Message Dwords 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 38 CA /r 

RM 

V/V 

SHA 

Performs the final calculation for the next four SHAl message 

SHA1MSG2xmm1, 




dwords using intermediate results from xmmi and the previous 

xmm2/m128 




message dwords from xmm2/m128, storing the result in xmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 


Description 

The SHA1MSG2 instruction is one of two SHAl message scheduling instructions. The instruction performs the final 
calculation to derive the next four SHAl message dwords. 

Operation 

SHA1MSG2 

W13 ^SRC2[95:64]; 

W14^SRC2[63:32]; 

W15 ^SRC2[31:0]; 

W16 ^ (SRC1 [127:96] XOR W13 ) ROL 1; 

W17 ^ (SRC1 [95:64] XOR W14) ROL 1; 

W18 ^ (SRC1 [63: 32] XOR W15) ROL 1; 

W19 ^ (SRC1 [31:0] XOR W16) ROL 1; 

DEST[127:96] ^ W16; 

DEST[95:64] ^ W17; 

DEST[63:32] ^ W18; 

DEST[31:0] ^ W19; 

Intel C/C++ Compiler Intrinsic Equivalent 

SHAl MSG2:_ml 28i _mm_sha1 msg2_epu32(_ml 28i,_ml 28i); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHA256RNDS2—Perform Two Rounds of SHA256 Operation 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 38 CB /r 

SHA256RNDS2 xmmi, 
xmm2/m128, <XMM0> 

RMO 

V/V 

SHA 

Perform 2 rounds of SHA256 operation using an initial SHA256 
state (C,D,G,H) from xmmi, an initial SHA256 state (A,B,E,F) from 
xmmZ/ml 28, and a pre-computed sum of the next 2 round mes¬ 
sage dwords and the corresponding round constants from the 
implicit operand XMMO, storing the updated SHA256 state 
(A,B,E,F) result in xmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

Implicit XMMO (r) 


Description 

The SHA256RNDS2 instruction performs 2 rounds of SHA256 operation using an initial SHA256 state (C,D,G,H) 
from the first operand, an initial SHA256 state (A,B,E,F) from the second operand, and a pre-computed sum of the 
next 2 round message dwords and the corresponding round constants from the implicit operand xmmO. Note that 
only the two lower dwords of XMMO are used by the instruction. 

The updated SFIA256 state (A,B,E,F) is written to the first operand, and the second operand can be used as the 
updated state (C,D,G,FI) in later rounds. 

Operation 

SHAZ56RNDS2 

A_0 ^ SRC2[127:96]; 

B_0 ^ SRC2[95:64]; 

C_0^SRC1 [127:96]; 

D_0 ^ SRC1 [95:64]; 

E_0 ^ SRC2[63:32]; 

F_0^SRC2[31:0]; 

G_0^SRC1[63:32]; 

H_0 ^ SRC 1 [31:0]; 

WKo^XMMO[31:0]; 

WKi ^ XMM0[63: 32]; 

FOR i = 0 to 1 

A_(i +1) ^ Ch (EJ, FJ, GJ) +Zi( E_i) +WK|+ HJ + Maj(A_i, B_i, CJ) +Xo( A_i); 

B_(i+1)^AJ; 

C_(i+1)^BJ; 

D_(i+1)^CJ; 

E_(i +1) ^ Ch (EJ, FJ, GJ) +Xi( EJ) +WK|+ HJ + DJ; 

F_(i+1)^EJ; 

G_(i +1) ^ FJ; 

H_(i+1)^GJ; 

ENDFOR 

DEST[127:96] ^ A_2; 

DEST[95:64] ^ B_2; 

DEST[63:32] ^ E_2; 

DEST[31:0]^F_2; 


SHA256RNDS2—Perform Two Rounds of SHA256 Operation 
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Intel C/C++ Compiler Intrinsic Equivalent 

SHA256RNDS2: _m1281 _mm_sha256rnds2_epu32(_m1281_ml 281, _m128i); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHA256MSG1 —Perform an Intermediate Calculation for the Next Four SHA256 Message 
□words 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 38 CC /r 

RM 

V/V 

SHA 

Performs an intermediate calculation for the next four SFIA256 

SHA256MSG1 xmmi, 




message dwords using previous message dwords from xmmi and 

xmm2/m128 




xmm2/m128, storing the result in xmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 


Description 

The SHA256MSG1 instruction is one of two SHA256 message scheduling instructions. The instruction performs an 
intermediate calculation for the next four SHA256 message dwords. 

Operation 

SHAZ56MSG1 

W4^SRC2[31:0]; 

W3 ^SRCI [127:96]; 

W2 ^SRCI [95:64]; 

W1 ^ SRC1 [63:32] ; 

W0^SRC1[31:0]; 

DEST[127:96] ^ W3 + OqC W4); 

DEST[95:64] ^ W2 + OqC W3); 

DEST[63:32] ^ W1 + OqC W2); 

DEST[31:0] ^ W0 + Oo( W1); 

Intel C/C++ Compiler Intrinsic Equivalent 

SHA256MSG1: _m1281 _mm_sha256msg1_epu32(_m1281 ml 28i); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHA256MSG2—Perform a Final Calculation for the Next Four SHA256 Message Dwords 


Opcode/ 

Instruction 

Op/En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 38 CD /r 

RM 

V/V 

SHA 

Performs the final calculation for the next four SHA256 message 

SHA256MSG2xmm1, 




dwords using previous message dwords from xmmi and 

xmmZ/ml 28 




xmm2/m128, storing the result in xmmi. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 


Description 

The SHA256MSG2 instruction is one of two SHA2 message scheduling instructions. The instruction performs the 
final calculation for the next four SHA256 message dwords. 

Operation 

SHA256MSG2 

W14^SRC2[95:64]; 

W15 ^SRC2[127:96]; 

W16^SRC1[31:0] + Oi( W14); 

W17 ^ SRC1 [63: 32] + Oi( W15) ; 

W18 ^ SRC1 [95: 64] + Oi( W16) ; 

W19 ^ SRC1 [127: 96] + Oi( W17) ; 

DEST[127:96] ^ W19; 

DEST[95:64]^W18; 

DEST[63:32]^W17; 

DEST[31:0] ^ W16; 

Intel C/C++ Compiler Intrinsic Equivalent 

SHA256MSG2 : _m128i _mm_sha256msg2_epu32(_m128i_ml 281); 

Flags Affected 

None 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

See Exceptions Type 4. 
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SHLD—Double Precision Shift Left 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF A4 /r ib 

SHLD r/m16, rl6, imm8 

MRI 

Valid 

Valid 

Shift r/m 76 to left /mmS places while shifting 
bits from r76 in from the right. 

OF AS /r 

SHLDr/m76,r76,CL 

MRC 

Valid 

Valid 

Shift r/m 7 6 to left CL places while shifting bits 
from r76 in from the right. 

OF A4 /r ib 

SHLD r/m32, r32, imm8 

MRI 

Valid 

Valid 

Shift r/m32 to left /mmS places while shifting 
bits from r32 in from the right. 

REX.W + OF A4 /r ib 

SHLD r/m64, r64, imm8 

MRI 

Valid 

N.E. 

Shift r/m64 to left /mmS places while shifting 
bits from r64 in from the right. 

OF AS /r 

SHLD r/m32, r32, CL 

MRC 

Valid 

Valid 

Shift r/m32 to left CL places while shifting bits 
from r32 in from the right. 

REX.W + OF AS /r 

SHLD r/m64, r64, CL 

MRC 

Valid 

N.E. 

Shift r/m64 to left CL places while shifting 
bits from r64 in from the right. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MRI 

ModRM:r/m (w) 

ModRM:reg (r) 

immS 

NA 

MRC 

ModRM:r/m (w) 

ModRM:reg (r) 

CL 

NA 


Description 

The SHLD instruction is used for multi-precision shifts of 64 bits or more. 

The instruction shifts the first operand (destination operand) to the left the number of bits specified by the third 
operand (count operand). The second operand (source operand) provides bits to shift in from the right (starting 
with bit 0 of the destination operand). 

The destination operand can be a register or a memory location; the source operand is a register. The count 
operand is an unsigned integer that can be stored in an immediate byte or in the CL register. If the count operand 
is CL, the shift count is the logical AND of CL and a count mask. In non-64-bit modes and default 64-bit mode; only 
bits 0 through 4 of the count are used. This masks the count to a value between 0 and 31. If a count is greater than 
the operand size, the result is undefined. 

If the count is 1 or greater, the CF flag is filled with the last bit shifted out of the destination operand. For a 1-bit 
shift, the OF flag is set if a sign change occurred; otherwise, it is cleared. If the count operand is 0, flags are not 
affected. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits 
(upgrading the count mask to 6 bits). See the summary chart at the beginning of this section for encoding data and 
limits. 

Operation 

IF (In 64-Blt Mode and REX.W = 1) 

THEN COUNT ^ COUNT MOD 64; 

ELSE COUNT ^ COUNT MOD 32; 

FI 

SIZE <- OperandSize; 

IF COUNT =0 
THEN 

No operation; 

ELSE 
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IF COUNT > SIZE 

TFIEN (* Bad parameters *) 

BEST is undefined; 

CF, OF, SF, ZF, AF, PF are undefined; 

ELSE (* Perform the shift *) 

CF ^ BIT[DEST, SIZE - COUNT]; 

(* Last bit shifted out on exit *) 

FOR i ^ SIZE - 1 DOWN TO COUNT 
DO 

Bit(DEST, I) ^ Bit(DEST, i - COUNT); 

OD; 

FOR i ^ COUNT - 1 DOWN TO 0 
DO 

BIT[DEST, i] ^ BIT[SRC, i - COUNT + SIZE]; 

OD; 

FI; 

FI; 

Flags Affected 

If the count is 1 or greater, the CF flag is filled with the last bit shifted out of the destination operand and the SF, ZF, 
and PF flags are set according to the value of the result. For a 1-bit shift, the OF flag is set if a sign change occurred; 
otherwise, it is cleared. For shifts greater than 1 bit, the OF flag is undefined. If a shift occurs, the AF flag is unde¬ 
fined. If the count operand is 0, the flags are not affected. If the count is greater than the operand size, the flags 
are undefined. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 
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64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form 


#GP(0) 

#PF(fault-code) 

#AC(0) 

If the memory address is in a non-canonical form. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If the LOCK prefix is used. 
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SHRD—Double Precision Shift Right 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF AC /r lb 

SHRD r/mT6, rl6, imm8 

MRI 

Valid 

Valid 

Shift r/m 76 to right /mmS places while 
shifting bits from r76 in from the left. 

OF AD /r 

SHRDr/m76,r76,CL 

MRC 

Valid 

Valid 

Shift r/m 76 to right CL places while shifting 
bits from r76 in from the left. 

OF AC /r lb 

SHRD r/m32, r32, imm8 

MRI 

Valid 

Valid 

Shift r/m32 to right /mmS places while 
shifting bits from r32 in from the left. 

REX.W + OF AC /r lb 

SHRD r/m64, r64, imm8 

MRI 

Valid 

N.E. 

Shift r/m64 to right imm8 places while 
shifting bits from r64 in from the left. 

OF AD /r 

SHRD r/m32, r32, CL 

MRC 

Valid 

Valid 

Shift r/m32 to right CL places while shifting 
bits from r32 in from the left. 

REX.W + OF AD /r 

SHRD rlm84, r64, CL 

MRC 

Valid 

N.E. 

Shift r/m64 to right CL places while shifting 
bits from r64 in from the left. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

MRI 

ModRM:r/m (w) 

ModRM:reg (r) 

immS 

NA 

MRC 

ModRM:r/m (w) 

ModRM:reg (r) 

CL 

NA 


Description 

The SHRD instruction is useful for multi-precision shifts of 64 bits or more. 

The instruction shifts the first operand (destination operand) to the right the number of bits specified by the third 
operand (count operand). The second operand (source operand) provides bits to shift in from the left (starting with 
the most significant bit of the destination operand). 

The destination operand can be a register or a memory location; the source operand is a register. The count 
operand is an unsigned integer that can be stored in an immediate byte or the CL register. If the count operand is 
CL, the shift count is the logical AND of CL and a count mask. In non-64-bit modes and default 64-bit mode, the 
width of the count mask is 5 bits. Only bits 0 through 4 of the count register are used (masking the count to a value 
between 0 and 31). If the count is greater than the operand size, the result is undefined. 

If the count is 1 or greater, the CF flag is filled with the last bit shifted out of the destination operand. For a 1-bit 
shift, the OF flag is set if a sign change occurred; otherwise, it is cleared. If the count operand is 0, flags are not 
affected. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits 
(upgrading the count mask to 6 bits). See the summary chart at the beginning of this section for encoding data and 
limits. 

Operation 

IF (In 64-Blt Mode and REX.W = 1) 

THEN COUNT ^ COUNT MOD 64; 

ELSE COUNT ^ COUNT MOD 32; 

FI 

SIZE <- OperandSIze; 

IF COUNT = 0 
THEN 

No operation; 

ELSE 
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IF COUNT > SIZE 

TFIEN (* Bad parameters *) 

DEST Is undefined; 

CF, OF, SF, ZF, AF, PF are undefined; 

ELSE (* Perform the shift *) 

CF ^ BIT[DEST, COUNT - 1 ]; (* Last bit shifted out on exit *) 

FOR i ^ 0 TO SIZE - 1 - COUNT 
DO 

BIT[DEST, i] ^ BIT[DEST, i + COUNT]; 

OD; 

FOR i ^ SIZE - COUNT TO SIZE - 1 
DO 

BIT[DEST,I] ^ BIT[SRC, I + COUNT - SIZE]; 

OD; 

FI; 

FI; 

Flags Affected 

If the count is 1 or greater, the CF flag is filled with the last bit shifted out of the destination operand and the SF, 
ZF, and PF flags are set according to the value of the result. For a 1-bit shift, the OF flag is set if a sign change 
occurred; otherwise, it is cleared. For shifts greater than 1 bit, the OF flag is undefined. If a shift occurs, the AF flag 
is undefined. If the count operand is 0, the flags are not affected. If the count is greater than the operand size, the 
flags are undefined. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective 

#SS If a memory operand effective 

#UD If the LOCK prefix is used. 

\/irtual-8086 Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 


address is outside the CS, DS, ES, FS, or GS segment limit, 
address is outside the SS segment limit. 
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e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form 


#GP(0) 

#PF(fault-code) 

#AC(0) 

If the memory address is in a non-canonical form. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If the LOCK prefix is used. 
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SHUFPD—Packed Interleave Shuffle of Pairs of Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF C6 /r lb 

SFIUFPD xmmi, xmm2/m128, imm8 

RMI 

V/V 

SSE2 

Shuffle two pairs of double-precision floating-point 
values from xmmi and xmm2/m128 using imm8 to 
select from each pair, interleaved result is stored in 
xmmi. 

VEX.NDS.128.66.0F.WIG C6 /r ib 
VSFIUFPD xmmi, xmm2, xmm3/m128, 
imm8 

RVMI 

v/v 

AVX 

Shuffle two pairs of double-precision floating-point 
values from xmm2 and xmm3/m128 using imm8 to 
select from each pair, interleaved result is stored in 
xmmi. 

VEX.NDS.256.66.0F.WIG C6 /r ib 
VSFIUFPD ymmi, ymm2, ymm3/m256, 
imm8 

RVMI 

V/V 

AVX 

Shuffle four pairs of double-precision floating-point 
values from ymm2 and ymm3/m256 using imm8 to 
select from each pair, interleaved result is stored in 
xmmi. 

EVEX.NDS.128.66.0F.W1 C6/r ib 
VSHUFPD xmmi[k1 ][z], xmm2, 
xmm3/m128/m64bcst, imm8 

FV 

v/v 

AVX512VL 

AVX512F 

Shuffle two paris of double-precision floating-point 
values from xmm2 and xmm3/m128/m64bcst using 
imm8 to select from each pair, store interleaved 
results in xmmi subject to writemask k1. 

EVEX.NDS.256.66.0F.W1 C6 /r ib 
VSHUFPD ymmi {k1 }{z}, ymm2, 
ymm3/m256/m64bcst, imm8 

FV 

v/v 

AVX512VL 

AVX512F 

Shuffle four paris of double-precision floating-point 
values from ymm2 and ymm3/m256/m64bcst using 
imm8 to select from each pair, store interleaved 
results in ymmi subject to writemask k1. 

EVEX.NDS.512.66.0F.W1 C6/r ib 
VSHUFPD zmmi {k1 }{z}, zmm2, 
zmm3/m512/m64bcst, imm8 

FV 

v/v 

AVX512F 

Shuffle eight paris of double-precision floating-point 
values from zmm2 and zmm3/m512/m64bcst using 
imm8 to select from each pair, store interleaved 
results in zmmi subject to writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

ImmS 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

Imm8 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

ImmS 


Description 

Selects a double-precision floating-point value of an input pair using a bit control and move to a designated 
element of the destination operand. The low-to-high order of double-precision element of the destination operand 
is interleaved between the first source operand and the second source operand at the granularity of input pair of 
128 bits. Each bit in the imm8 byte, starting from bit 0, is the select control of the corresponding element of the 
destination to received the shuffled result of an input pair. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
64-bit memory location The destination operand is a ZMM/YMM/XMM register updated according to the writemask. 
The select controls are the lower 8/4/2 bits of the imm8 byte. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. The select controls are the bit 3:0 
of the imm8 byte, imm8[7:4) are ignored. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. The select controls are the bit 1:0 of the imm8 byte, 
imm8[7:2) are ignored. 
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128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation operand and the first source operand is the same and is an XMM register. The upper bits (MAX_VL-1:128) of 
the corresponding ZMM register destination are unmodified. The select controls are the bit 1:0 of the imm8 byte, 
imm8[7:2) are ignored. 



Figure 4-25. 256-bit VSHUFPD Operation of Four Pairs of DP FP Values 


Operation 

VSHUFPD (EVEX encoded versions when SRC2 is a vector register) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IF IMM0[0] = 0 

THEN TMP_DEST[63:0] <- SRC1[63:0] 

ELSE TMP_DEST[63:0] <r SRC1 [127:64] FI; 

IFIMM0[1] = 0 

THEN TMP_DEST[127:64] <r SRC2[63:0] 

ELSE TMP_DEST[127:64] ^ SRC2[127:64] FI; 

IFVL>=256 
IF IMM0[2] = 0 

THEN TMP_DEST[191:128] <- SRC1 [191:128] 

ELSE TMP_DEST[191:128] ^ SRC1 [255:192] FI; 

IF IMM0[3] = 0 

THEN TMP_DEST[255:192] <- SRC2[191:128] 

ELSE TMP_DEST[255:192] ^ SRC2[255:192] FI; 

FI; 

IFVL>=512 
IF IMM0[4] = 0 

THEN TMP_DEST[319:256] <- SRC1 [319:256] 

ELSE TMP_DEST[319:256] <- SRC1 [383:320] FI; 

IF IMM0[5] = 0 

THEN TMP_DEST[383:320] <- SRC2[319:256] 
ELSETMP_DEST[383:320] <- SRC2[383:320] FI; 

IF IMM0[6] = 0 

THEN TMP_DEST[447:384] <- SRC1 [447:384] 

ELSE TMP_DEST[447:384] <- SRC1 [511:448] FI; 

IF IMM0[7] = 0 

THEN TMP_DEST[511:448] <- SRC2[447:384] 

ELSE TMP_DEST[511:448] <- SRC2[511:448] FI; 

FI; 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TMP_DEST[I+63:I] 

ELSE 
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IF *merglng-masking* ; merging-masking 

TFIEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masklng* ; zeroIng-maskIng 

DEST[I+63:I] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VSHUFPD (EVEX encoded versions when SRC2 is memory) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
I ^ j * 64 
IF(EVEX.b= 1) 

THEN TMP_SRC2[i+63:i] ^ SRC2[63:0] 

ELSE TMP_SRC2[i+63:i] ^ SRC2[i+63:i] 

FI; 

ENDFOR; 

IF IMM0[0] = 0 

THEN TMP_DEST[63:0] <- SRC1 [63:0] 

ELSE TMP_DEST[63:0] <- SRC1 [127:64] FI; 

IFIMM0[1] = 0 

THEN TMP_DEST[127:64] <- TMP_SRC2[63:0] 

ELSE TMP_DEST[127:64] <- TMP_SRC2[127:64] FI; 

IFVL>=256 
IF IMM0[2] = 0 

THEN TMP_DEST[191:128] <- SRC1 [191:128] 

ELSE TMP_DEST[191:128] <- SRC1 [255:192] FI; 

IF IMM0[3] = 0 

THEN TMP_DEST[255:192] <- TMP_SRC2[191:128] 

ELSE TMP_DEST[255:192] <- TMP_SRC2[255:192] FI; 

FI; 

IFVL>=512 
IF IMM0[4] = 0 

THEN TMP_DEST[319:256] <- SRC1 [319:256] 

ELSE TMP_DEST[319:256] <- SRC1 [383:320] FI; 

IF IMM0[5] = 0 

THEN TMP_DEST[383:320] <- TMP_SRC2[319:256] 

ELSE TMP_DEST[383:320] <- TMP_SRC2[383:320] FI; 

IF IMM0[6] = 0 

THEN TMP_DEST[447:384] <- SRC1 [447:384] 

ELSE TMP_DEST[447:384] <- SRC1 [511:448] FI; 

IF IMM0[7] = 0 

THEN TMP_DEST[511:448] <- TMP_SRC2[447:384] 

ELSE TMP_DEST[511:448] <- TMP_SRC2[511:448] FI; 

FI; 

FOR] ^0 TO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* 

THEN DEST[i+63:i] ^ TMP_DEST[i+63:i] 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 
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ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VSHUFPD (VEX.256 encoded version) 

IF IMM0[0] = 0 

THEN DEST[63:0] ^SRC1[63:0] 

ELSE DEST[63:0] ^SRCI [127:64] FI; 
IFIMM0[1] = 0 

THEN DEST[127:64] ^SRC2[63:0] 

ELSE DEST[127:64] ^SRC2[127:64] FI; 

IF IMM0[2] = 0 

THEN DEST[191:128] ^SRCI [191:128] 
ELSE DEST[191:128] ^SRCI [255:192] FI; 
IF IMM0[3] = 0 

THEN DEST[255:192] ^SRC2[191:128] 
ELSE DEST[255:192] ^SRC2[255:192] FI; 
DEST[MAX_VL-1:256] (Unmodified) 


VSHUFPD (VEX.128 encoded version) 

IF IMM0[0] = 0 

THEN DEST[63:0] ^SRC1[63:0] 

ELSE DEST[63:0] ^SRCI [127:64] FI; 
IFIMM0[1] = 0 

THEN DEST[127:64] ^SRC2[63:0] 

ELSE DEST[127:64] ^SRC2[127:64] FI; 
DEST[MAX_VL-1:128] ^0 


VSHUFPD (128-bit Legacy SSE version) 

IF IMM0[0] = 0 

THEN DEST[63:0] ^SRC1[63:0] 

ELSE DEST[63:0] ^SRCI [127:64] FI; 

IFIMM0[1] = 0 

THEN DEST[127:64] ^SRC2[63:0] 

ELSE DEST[127:64] ^SRC2[127:64] FI; 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSHUFPD _m512d _mm512_shuffle_pd(_m512d a, _m512d b, int imm); 

VSHUFPD_m512d _mm512_mask_shuffle_pd(_m512d s,_mmask8 k,_m512d a,_m512d b, int imm); 

VSHUFPD_m512d_mm512_maskz_shuffle_pd(_mmask8 k,_m512d a,_m512d b, int imm); 

VSHUFPD_m256d _mm256_shuffle_pd (_m256d a,_m256d b, const int select); 

VSHUFPD_m256d _mm256_mask_shuffle_pd(_m256d s,_mmask8 k,_m256d a,_m256d b, int imm); 

VSHUFPD_m256d _mm256_maskz_shuffle_pd(_mmask8 k,_m256d a,_m256d b, int imm); 

SHUFPD_m128d _mm_shuffle_pd (_ml 28d a,_ml 28d b, const int select); 

VSHUFPD_ml 28d _mm_mask_shuffle_pd(_ml 28d s,_mmask8 k,_ml 28d a,_ml 28d b, Int Imm); 

VSHUFPD_ml 28d _mm_maskz_shuffle_pd(_mmask8 k,_ml 28d a,_ml 28d b, Int Imm); 


4-620 Vol. 2B 


SHUFPD—Packed Interleave Shuffle of Pairs of Double-Precision Floating-Point Values 


INSTRUCTION SET REFERENCE, M-U 


SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 
EVEX-encoded instruction, see Exceptions Type E4NF. 
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SHUFPS—Packed Interleave Shuffle of Quadruplets of Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

OF C6 /r ib 

SFIUFPS xmmi, xnnm3/m128, imm8 

RMI 

V/V 

SSE 

Select from quadruplet of single-precision floating¬ 
point values in xmmi and xmm2/m128 using imm8, 
interleaved result pairs are stored in xmmi. 

VEX.NDS.128.0F.WIG C6/rib 

VSHUFPS xmm1,xmm2, 
xmm3/nn128, imm8 

RVMI 

v/v 

AVX 

Select from quadruplet of single-precision floating¬ 
point values in xmmi and xmm2/m128 using imm8, 
interleaved result pairs are stored in xmmi. 

VEX.NDS.256.0F.WIG C6 /r ib 

VSHUFPS ymm1,ymm2, 
ymm3/m256, imm8 

RVMI 

V/V 

AVX 

Select from quadruplet of single-precision floating¬ 
point values in ymm2 and ymm3/m256 using imm8, 
interleaved result pairs are stored in ymmi. 

EVEX.NDS.128.0F.W0 C6 /r ib 

VSHUFPS xmmi {k1 ]{z}, xmm2, 
xmm3/m128/m32bcst, imm8 

FV 

v/v 

AVX512VL 

AVX512F 

Select from quadruplet of single-precision floating¬ 
point values in xmmi and xmm2/m128 using imm8, 
interleaved result pairs are stored in xmmi, subject to 
writemask k1. 

EVEX.NDS.256.0F.W0 C6 /r ib 

VSHUFPS ymmi [k1 }{z}, ymm2, 
ymm3/m256/m32bcst, imm8 

FV 

v/v 

AVX512VL 

AVX512F 

Select from quadruplet of single-precision floating¬ 
point values in ymm2 and ymm3/m256 using imm8, 
interleaved result pairs are stored in ymmi, subject to 
writemask k1. 

EVEX.NDS.51 2.0F.W0 C6 /r ib 

VSHUFPS zmmi {k1 }{z}, zmm2, 
zmm3/m512/m32bcst, imm8 

FV 

v/v 

AVX512F 

Select from quadruplet of single-precision floating¬ 
point values in zmm2 and zmm3/m512 using immS, 
interleaved result pairs are stored in zmmi, subject to 
writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RMI 

ModRM:reg (r, w) 

ModRM:r/m (r) 

Imm8 

NA 

RVMI 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

ImmS 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

ImmS 


Description 

Selects a single-precision floating-point value of an input quadruplet using a two-bit control and move to a desig¬ 
nated element of the destination operand. Each 64-bit element-pair of a 128-bit lane of the destination operand is 
interleaved between the corresponding lane of the first source operand and the second source operand at the gran¬ 
ularity 128 bits. Each two bits in the imm8 byte, starting from bit 0, is the select control of the corresponding 
element of a 128-bit lane of the destination to received the shuffled result of an input quadruplet. The two lower 
elements of a 128-bit lane in the destination receives shuffle results from the quadruple of the first source operand. 
The next two elements of the destination receives shuffle results from the quadruple of the second source operand. 

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be 
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 
32-bit memory location. The destination operand is a ZMM/YMM/XMM register updated according to the writemask. 
Imm8[7:0] provides 4 select controls for each applicable 128-bit lane of the destination. 

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM 
register or a 256-bit memory location. The destination operand is a YMM register. Imm8[7:0] provides 4 select 
controls for the high and low 128-bit of the destination. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. Imm8[7:0] provides 4 select controls for each element 
of the destination. 
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128-bit Legacy SSE version: The source can be an XMM register or an 128-bit memory location. The destination is 
not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding ZMM 
register destination are unmodified. Imm8[7:0] provides 4 select controls for each element of the destination. 
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Figure 4-26. 256-bit VSHUFPS Operation of Selection from Input Quadruplet and Pair-wise Interleaved Result 


Operation 

Select4(SRC, control) [ 

CASE (control[1:0]) OF 
0: TMP^SRC[31:0]; 

1: TMP ^SRC[63:32]; 

2: TMP ^SRC[95:64]; 

3: TMP ^SRC[127:96]; 

ESAC; 

RETURN TMP 

} 

VPSHUFPS (EVEX encoded versions when SRC2 is a vector register) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

TMP_DEST[31:0] <r Select4(SRC1 [127:0], imm8[1:0]); 
TMP_DEST[63:32] Select4(SRC1 [127:0], imm8[3:2]); 
TMP_DEST[95:64] <r Select4(SRC2[127:0], imm8[5:4]); 

TMP_DEST[127:96] Select4(SRC2[127:0], imm8[7:6]); 

IFVL>=256 

TMP_DEST[159:128] <- Select4(SRC1 [255:128], imm8[1:0]); 
TMP_DEST[191:160] <- Select4(SRC1 [255:128], imm8[3:2]); 
TMP_DEST[223:192] <- Select4(SRC2[255:128], imm8[5:4]); 
TMP_DEST[255:224] <- Select4(SRC2[255:128], imm8[7:6]); 

FI; 

IFVL>=512 

TMP_DEST[287:256] ^ Select4(SRC1 [383:256], imm8[1:0]); 
TMP_DEST[319:288] ^ Select4(SRC1 [383:256], imm8[3:2]); 
TMP_DEST[351:320] ^ Select4(SRC2[383:256], imm8[5:4]); 
TMP_DEST[383:352] ^ Select4(SRC2[383:256], imm8[7:6]); 
TMP_DEST[415:384] ^ Select4(SRC1 [511:384], imm8[1:0]); 
TMP_DEST[447:416] ^ Select4(SRC1 [511:384], imm8[3:2]); 
TMP_DEST[479:448] ^Select4(SRC2[511:384], Imm8[5:4]); 
TMP_DEST[511:480] ^ Select4(SRC2[511:384], imm8[7:6]); 

FI; 

FOR] ^0 TO KL-1 
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l^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:l]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VPSHUFPS (EVEX encoded versions when SRC2 is memory) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 
IF(EVEX.b= 1) 

THEN TMP_SRC2[l+31:i] ^ SRC2[31:0] 

ELSE TMP_SRC2[I+31 :l] ^ SRC2[i+31 :i] 

FI; 

ENDFOR; 

TMP_DEST[31:0] <- Select4(SRC1 [127:0], imm8[1:0]); 

TMP_DEST[63:32] <- Select4(SRC1 [127:0], imm8[3:2]); 
TMP_DEST[95:64] <- Select4(TMP_SRC2[127:0], imm8[5:4]); 
TMP_DEST[127:96] <- Select4(TMP_SRC2[127:0], imm8[7:6]); 
IFVL>=256 

TMP_DEST[159:128] <- Select4(SRC1 [255:128], imm8[1:0]); 
TMP_DEST[191:160] <- Select4(SRC1 [255:128], imm8[3:2]); 
TMP_DEST[223:192] <- Select4(TMP_SRC2[255:128], imm8[5:4]); 
TMP_DEST[255:224] <- Select4(TMP_SRC2[255:128], imm8[7:6]); 

FI; 

IFVL>=512 

TMP_DEST[287:256] ^ Select4(SRC1 [383:256], imm8[1:0]); 
TMP_DEST[319:288] ^ Select4(SRC1 [383:256], imm8[3:2]); 
TMP_DEST[351:320] ^ Select4(TMP_SRC2[383:256], imm8[5:4]); 
TMP_DEST[383:352] ^ Select4(TMP_SRC2[383:256], imm8[7:6]); 
TMP_DEST[415:384] ^ Select4(SRC1 [511:384], imm8[1:0]); 
TMP_DEST[447:416] ^ Select4(SRC1 [511:384], imm8[3:2]); 
TMP_DEST[479:448] ^Select4(TMP_SRC2[511:384], imm8[5:4]); 
TMP_DEST[511:480] ^ Select4(TMP_SRC2[511:384], imm8[7:6]); 

FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i] ^ TMP_DEST[i+31 :i] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 
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DEST[MAX_VL-1:VL]^0 
VSHUFPS (VEX.256 encoded version) 

DEST[31:0] ^Select4(SRC1 [127:0], Imm8[1:0]); 

DEST[63:32] ^Select4(SRC1 [127:0], Imm8[3:2]); 

DEST[95:64] ^Select4(SRC2[127:0], Imm8[5:4]); 

DEST[127:96] ^Select4(SRC2[127:0], Imm8[7:6]); 

DEST[159:128] ^Select4(SRC1 [255:128], imm8[1:0]); 

DEST[191:160] ^Select4(SRC1 [255:128], imm8[3:2]); 

DEST[223:192] ^Select4(SRC2[255:128], imm8[5:4]); 

DEST[255:224] ^Select4(SRC2[255:128], imm8[7:6]); 

DEST[MAX_VL-1:256] ^0 

VSHUFPS (VEX.128 encoded version) 

DEST[31:0] ^Select4(SRC1 [127:0], imm8[1:0]); 

DEST[63:32] ^Select4(SRC1 [127:0], imm8[3:2]); 

DEST[95:64] ^Select4(SRC2[127:0], imm8[5:4]); 

DEST[127:96] ^Select4(SRC2[127:0], imm8[7:6]); 

DEST[MAX_VL-1:128] ^0 

SHUFPS (128-bit Legacy SSE version) 

DEST[31:0] ^Select4(SRC1 [127:0], imm8[1:0]); 

DEST[63:32] ^Select4(SRC1 [127:0], imm8[3:2]); 

DEST[95:64] ^Select4(SRC2[127:0], imm8[5:4]); 

DEST[127:96] ^Select4(SRC2[127:0], imm8[7:6]); 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSHUFPS _m512 _mm512_shuffle_ps(_m512 a_m512 b, int imm); 

VSHUFPS_m512_mm512_mask_shuffle_ps(_m512 s,_mmask16 k,_m512 a,_m512 b, int imm); 

VSHUFPS_m512_mm512_maskz_shuffle_ps(_mmask16 k,_m512 a,_m512 b, int imm); 

VSHUFPS_m256 _mm256_shuffle_ps (_m256 a,_m256 b, const int select); 

VSHUFPS_m256 _mm256_mask_shuffle_ps(_m256 s,_mmask8 k,_m256 a,_m256 b, int imm); 

VSHUFPS_m256 _mm256_maskz_shuffle_ps(_mmask8 k,_m256 a,_m256 b, int imm); 

SHUFPS_ml 28 _mm_shuffle_ps (_ml 28 a,_ml 28 b, const int select); 

VSHUFPS_ml 28 _mm_mask_shuffle_ps(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b, int imm); 

VSHUFPS_ml 28 _mm_maskz_shuffle_ps(_mmask8 k,_ml 28 a,_ml 28 b, int imm); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 4. 

EVEX-encoded instruction, see Exceptions Type E4NF. 
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SIDT—Store Interrupt Descriptor Table Register 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 /I 

SIDTm 

M 

Valid 

Valid 

Store IDTR to m. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Stores the content the interrupt descriptor table register (IDTR) in the destination operand. The destination 
operand specifies a 6-byte memory location. 

In non-64-bit modes, if the operand-size attribute is 32 bits, the 16-bit limit field of the register is stored in the low 
2 bytes of the memory location and the 32-bit base address is stored in the high 4 bytes. If the operand-size attri¬ 
bute is 16 bits, the limit is stored in the low 2 bytes and the 24-bit base address is stored in the third, fourth, and 
fifth byte, with the sixth byte filled with Os. 

In 64-bit mode, the operand size fixed at 8-1-2 bytes. The instruction stores 8-byte base and 2-byte limit values. 

SIDT is only useful in operating-system software; however, it can be used in application programs without causing 
an exception to be generated if CR4.UMIP = 0. See "LGDT/LIDT—Load Global/Interrupt Descriptor Table Register" 
in Chapter 3, Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, for information on 
loading the GDTR and IDTR. 

IA-32 Architecture Compatibility 

The 16-bit form of SIDT is compatible with the Intel 286 processor if the upper 8 bits are not referenced. The Intel 
286 processor fills these bits with Is; processor generations later than the Intel 286 processor fill these bits with 
Os. 

Operation 

IF Instruction is SIDT 
THEN 

IF OperandSlze= 16 
THEN 

DEST[0:15]^IDTR(Llmit); 

DEST[16:39] ^ IDTR(Base); (* 24 bits of base address stored; *) 

DEST[40:47] ^ 0; 

ELSE IF (32-bit Operand Size) 

DEST[0:15]^IDTR(Limit); 

DEST[16:47] ^ IDTR(Base); FI; (* Full 32-blt base address stored *) 

ELSE (* 64-blt Operand Size *) 

DEST[0:15]^IDTR(Llmit); 

DEST[16:79] ^ IDTR(Base); (* Full 64-bit base address stored *) 

FI; 

FI; 

Flags Affected 

None. 
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Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 
If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If CR4.UMIP = 1 and GPL > 0. 


#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 

#UD If the LOCK prefix is used. 


Real-Address Mode 

#GP 

#SS 

#UD 


Exceptions 

If a memory operand effective address 
If a memory operand effective address 
If the LOCK prefix is used. 


is outside the CS, DS, ES, FS, or GS segment limit, 
is outside the SS segment limit. 


\/irtual-8086 Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If CR4.UMIP = 1. 


#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 
If the LOCK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#UD If the destination operand is a register. 

If the LOCK prefix is used. 

#GP(0) If the memory address is in a non-canonical form. 

If CR4.UMIP = 1 and CPL > 0. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 
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SLOT—Store Local Descriptor Table Register 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 00 /O 

SLOT r/m 7 6 

M 

Valid 

Valid 

Stores segment selector from LDTR in r/m 16. 

REX.W + OF 00 /O 

SLOT r64/m 7 6 

M 

Valid 

Valid 

Stores segment selector from LDTR in 
r64/m16. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Stores the segment selector from the local descriptor table register (LDTR) in the destination operand. The desti¬ 
nation operand can be a general-purpose register or a memory location. The segment selector stored with this 
instruction points to the segment descriptor (located in the GOT) for the current LDT. This instruction can only be 
executed in protected mode. 

Outside IA-32e mode, when the destination operand is a 32-bit register, the 16-bit segment selector is copied into 
the low-order 16 bits of the register. The high-order 16 bits of the register are cleared for the Pentium 4, Intel Xeon, 
and P6 family processors. They are undefined for Pentium, Intel486, and Intel386 processors. When the destina¬ 
tion operand is a memory location, the segment selector is written to memory as a 16-bit quantity, regardless of 
the operand size. 

In compatibility mode, when the destination operand is a 32-bit register, the 16-bit segment selector is copied into 
the low-order 16 bits of the register. The high-order 16 bits of the register are cleared. When the destination 
operand is a memory location, the segment selector is written to memory as a 16-bit quantity, regardless of the 
operand size. 

In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). The 
behavior of SLOT with a 64-bit register is to zero-extend the 16-bit selector and store it in the register. If the desti¬ 
nation is memory and operand size is 64, SLOT will write the 16-bit selector to memory as a 16-bit quantity, 
regardless of the operand size. 

Operation 

DEST LDTR(SegmentSelector); 

Flags Affected 

None. 


Protected Mode Exceptions 


#GP(0) 


#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 


If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If CR4.UMIP = 1 and CPL > 0. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 
If the LOCK prefix is used. 
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Real-Address Mode Exceptions 

#UD The SLOT instruction is not recognized in real-address mode. 

Virtual-SOSe Mode Exceptions 

#UD The SLOT instruction is not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 


64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

If CR4.UMIP = 1 and CPL > 0. 


#PF(fault-code) 

#AC(0) 

#UD 


If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 
If the LOCK prefix is used. 


SLOT—Store Local Descriptor Table Register 
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SMSW—Store Machine Status Word 


Opcode* 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 /4 

SMSW r/m 7 6 

M 

Valid 

Valid 

Store machine status word to r/m16. 

OF 01 /4 

SMSVlr32/m16 

M 

Valid 

Valid 

Store machine status word in low-order 16 
bits of r32/ml6; high-order 16 bits of r32are 
undefined. 

REX.W + OF 01 /4 

SMSW r64/m 7 6 

M 

Valid 

Valid 

Store machine status word in low-order 16 
bits of r64/ml6; high-order 16 bits of r32are 
undefined. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Stores the machine status word (bits 0 through 15 of control register CRO) into the destination operand. The desti¬ 
nation operand can be a general-purpose register or a memory location. 

In non-64-bit modes, when the destination operand is a 32-bit register, the low-order 16 bits of register CRO are 
copied into the low-order 16 bits of the register and the high-order 16 bits are undefined. When the destination 
operand is a memory location, the low-order 16 bits of register CRO are written to memory as a 16-bit quantity, 
regardless of the operand size. 

In 64-bit mode, the behavior of the SMSW instruction is defined by the following examples: 

• SMSW rl6 operand size 16, store CR0[15:0] in rl6 

• SMSW r32 operand size 32, zero-extend CR0[31:0], and store in r32 

• SMSW r64 operand size 64, zero-extend CR0[63:0], and store in r64 

• SMSW ml6 operand size 16, store CR0[15:0] in ml6 

• SMSW ml6 operand size 32, store CR0[15:0] in ml6 (not m32) 

• SMSW ml6 operands size 64, store CR0[15:0] in ml6 (not m64) 

SMSW is only useful in operating-system software. However, it is not a privileged instruction and can be used in 
application programs if CR4.UMIP = 0. It is provided for compatibility with the Intel 286 processor. Programs and 
procedures intended to run on IA-32 and Intel 64 processors beginning with the Intel386 processors should use the 
MOV CR instruction to load the machine status word. 

See "Changes to Instruction Behavior in VMX Non-Root Operation" in Chapter 25 of the I ntel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 3C, for more information about the behavior of this instruction in 
VMX non-root operation. 

Operation 

BEST ^CR0[15:0]; 

(* Machine status word *) 

Flags Affected 

None. 
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Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 


#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If CR4.UMIP = 1 and GPL > 0. 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 
If the LOCK prefix is used. 


Real-Address Mode Exceptions 


#GP 

#SS(0) 

#UD 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If a memory operand effective address is outside the SS segment limit. 

If the LOCK prefix is used. 


\/irtual-8086 Mode Exceptions 


#GP(0) 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If CR4.UMIP = 1. 

#SS(0) 

#PF(fault-code) 

#AC(0) 

#UD 

If a memory operand effective address is outside the SS segment limit. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made. 

If the LOCK prefix is used. 


Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form 


#GP(0) 

If the memory address is in a non-canonical form. 

If CR4.UMIP = 1 and CPL > 0. 

#PF(fault-code) 

#AC(0) 

#UD 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while CPL = 3. 
If the LOCK prefix is used. 
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SQRTPD—Square Root of Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 51 /r 

SQRTPD xnnml, xmm2/m128 

RM 

V/V 

SSE2 

Computes Square Roots of the packed double-precision 
floating-point values in xmm2/m128 and stores the result 
in xmmi. 

VEX.128.66.0F.WIG51 /r 

VSQRTPD xmmi, xmm2/m128 

RM 

v/v 

AVX 

Computes Square Roots of the packed double-precision 
floating-point values in xmm2/m128 and stores the result 
in xmmi. 

VEX.256.66.0F.WIG 51 /r 

VSQRTPD ymmi, ymm2/m256 

RM 

V/V 

AVX 

Computes Square Roots of the packed double-precision 
floating-point values in ymm2/m256 and stores the result 
in ymmi. 

EVEX.128.66.0F.W1 51 /r 

VSQRTPD xmmi {l<1]{z}, 
xmm2/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Computes Square Roots of the packed double-precision 
floating-point values in xmm2/m128/m64bcst and stores 
the result in xmmi subject to writemask k1. 

EVEX.256.66.0F.W1 51 /r 

VSQRTPD ymmi {k1}{z}, 
ymm2/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Computes Square Roots of the packed double-precision 
floating-point values in ymm2/m256/m64bcst and stores 
the result in ymmi subject to writemask k1. 

EVEX.512.66.0F.W1 51 /r 

VSQRTPD zmmi {l<1}{z}, 
zmm2/m512/m64bcst{er} 

FV 

v/v 

AVX512F 

Computes Square Roots of the packed double-precision 
floating-point values in zmm2/m512/m64bcst and stores 
the result in zmmi subject to writemask k1. 



nstruction Operand Encoding 

Op/En 

Dperand 1 

Dperand 2 

Dperand 3 

Dperand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FV 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Performs a SIMD computation of the square roots of the two, four or eight packed double-precision floating-point 
values in the source operand (the second operand) stores the packed double-precision floating-point results in the 
destination operand (the first operand). 

EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or 
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a 
ZMM/YMM/XMM register updated according to the writemask. 

VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination 
operand is a YMM register. The upper bits (MAX_VL-1:256) of the corresponding ZMM register destination are 
zeroed. 

VEX. 128 encoded version: the source operand second source operand or a 128-bit memory location. The destina¬ 
tion operand is an XMM register. The upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are 
zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina¬ 
tion is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding ZMM 
register destination are unmodified. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 
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Operation 

VSQRTPD (EVEX encoded versions) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IF (VL = 512) AND (EVEX.b = 1) AND (SRC *ls register*) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

FORj^OTO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC *is memory*) 

THEN DEST[I+63:I] ^ SQRT(SRC[63:0]) 

ELSE DEST[I+63:I] ^ SQRT(SRC[i+63:i]) 

FI; 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VSQRTPD (VEX.256 encoded version) 

DEST[63:0] ^SQRT(SRC[63:0]) 

DEST[127:64] ^SQRT(SRC[127:64]) 

DEST[191:128] ^SQRT(SRC[191:128]) 

DEST[255:192] ^SQRT(SRC[255:192]) 

DEST[MAX_VL-1:256]^0 

VSQRTPD (VEX.128 encoded version) 

DEST[63:0] ^SQRT(SRC[63:0]) 

DEST[127:64] ^SQRT(SRC[127:64]) 

DEST[MAX_VL-1:128] ^0 

SQRTPD (128-bit Legacy SSE version) 

DEST[63:0] ^SQRT(SRC[63:0]) 

DEST[127:64] ^SQRT(SRC[127:64]) 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSQRTPD_m512d _mm512_sqrt_round_pd(_m512d a, int r); 

VSQRTPD_m512d_mm512_mask_sqrt_round_pd(_m512d s,_mmask8 k,_m512d a, int r); 

VSQRTPD_m512d _mm512_maskz_sqrt_round_pd(_mmask8 k,_m512d a, int r); 

VSQRTPD _m256d _mm256_sqrt_pd (_m256d a); 

VSQRTPD_m256d _mm256_mask_sqrt_pd(_m256d s,_mmask8 k,_m256d a, int r); 

VSQRTPD_m256d _mm256_maskz_sqrt_pd(_mmask8 k,_m256d a, int r); 

SQRTPD_ml 28d _mm_sqrt_pd (_ml 28d a); 

VSQRTPD_ml 28d _mm_mask_sqrt_pd(_ml 28d s,_mmask8 k,_ml 28d a, int r); 

VSQRTPD_ml 28d _mm_maskz_sqrt_pd(_mmask8 k,_ml 28d a, int r); 
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SIMD Floating-Point Exceptions 

Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instruction, see Exceptions Type E2. 

#UD If EVEX.vvvv != llllB. 
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SQRTPS—Square Root of Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 51 /r 

SQRTPS xmmi, xmm2/m128 

RM 

V/V 

SSE 

Computes Square Roots of the packed single-precision 
floating-point values in xmm2/m128 and stores the result in 
xmmi. 

VEX.128.0F.WIG 51 /r 

VSQRTPS xmmi, xmm2/m128 

RM 

v/v 

AVX 

Computes Square Roots of the packed single-precision 
floating-point values in xmm2/m128 and stores the result in 
xmmi. 

VEX.256.0F.WIG 51/r 

VSQRTPS ymmi, ymm2/m256 

RM 

V/V 

AVX 

Computes Square Roots of the packed single-precision 
floating-point values in ymm2/m256 and stores the result in 
ymmi. 

EVEX.128.0F.W0 51 /r 

VSQRTPS xmmi {k1]{z}, 
xmm2/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Computes Square Roots of the packed single-precision 
floating-point values in xmm2/m128/m32bcst and stores 
the result in xmmi subject to writemask k1. 

EVEX.256.0F.W0 51 /r 

VSQRTPS ymmi {k1}{z}, 
ymm2/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Computes Square Roots of the packed single-precision 
floating-point values in ymm2/m256/m32bcst and stores 
the result in ymmi subject to writemask k1. 

EVEX.512.0F.W0 51/r 

VSQRTPS zmmi [k1}[z}, 
zmm2/m512/m32bcst[er} 

FV 

v/v 

AVX512F 

Computes Square Roots of the packed single-precision 
floating-point values in zmm2/m512/m32bcst and stores 
the result in zmmi subject to writemask k1. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

FV 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Performs a SIMD computation of the square roots of the four, eight or sixteen packed single-precision floating-point 
values in the source operand (second operand) stores the packed single-precision floating-point results in the 
destination operand. 

EVEX.512 encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location 
or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a 
ZMM/YMM/XMM register updated according to the writemask. 

VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination 
operand is a YMM register. The upper bits (MAX_VL-1:256) of the corresponding ZMM register destination are 
zeroed. 

VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destina¬ 
tion operand is an XMM register. The upper bits (MAX_VL-1:128) of the corresponding ZMM register destination are 
zeroed. 

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina¬ 
tion is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding ZMM 
register destination are unmodified. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. 
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Operation 

VSQRTPS (EVEX encoded versions) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

IF (VL = 512) AND (EVEX.b = 1) AND (SRC *ls register*) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* THEN 

IF (EVEX.b = 1) AND (SRC *ls memory*) 

THEN DEST[l+31:i] ^ SQRT(SRC[31:0]) 

ELSE DEST[i+31 :l] ^ SQRT(SRC[I+31 :i]) 

FI; 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VSQRTPS (VEX.256 encoded version) 

DEST[31:0] eSQRT(SRC[31:0]) 
DEST[63:32] ^SQRT(SRC[63:32]) 
DEST[95:64] ^SQRT(SRC[95:64]) 
DEST[127:96] ^SQRT(SRC[127:96]) 
DEST[159:128] ^SQRT(SRC[159:128]) 
DEST[191:160] ^SQRT(SRC[191:160]) 
DEST[223:192] ^SQRT(SRC[223:192]) 
DEST[255:224] ^SQRT(SRC[255:224]) 


VSQRTPS (VEX.128 encoded version) 

DEST[31:0] eSQRT(SRC[31:0]) 
DEST[63:32] ^SQRT(SRC[63:32]) 
DEST[95:64] ^SQRT(SRC[95:64]) 
DEST[127:96] ^SQRT(SRC[127:96]) 
DEST[MAX_VL-1:128] ^0 


SQRTPS (128-bit Legacy SSE version) 

DEST[31:0] ^SQRT(SRC[31:0]) 
DEST[63:32] ^SQRT(SRC[63:32]) 
DEST[95:64] ^SQRT(SRC[95:64]) 
DEST[127:96] ^SQRT(SRC[127:96]) 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivalent 

VSQRTPS_mSI 2 _mm512_sqrt_round_ps(_mSI 2 a, Int r); 

VSQRTPS_mSI 2 _mm512_mask_sqrt_round_ps(_mSI 2 s,_mmaski 6 k,_mSI 2 a, Int r); 

VSQRTPS_mSI 2 _mm512_maskz_sqrt_round_ps(_mmaski 6 k,_mSI 2 a, Int r); 

VSQRTPS _m256 _mm256_sqrt_ps (_m256 a); 

VSQRTPS_m256 _mm256_mask_sqrt_ps(_m256 s,_mmaskS k,_m256 a, Int r); 

VSQRTPS_m256 _mm256_maskz_sqrt_ps(_mmaskS k,_m256 a, int r); 

SQRTPS_ml 28 _mm_sqrt_ps (_ml 28 a); 

VSQRTPS_ml 28 _mm_mask_sqrt_ps(_ml 28 s,_mmask8 k,_ml 28 a, Int r); 

VSQRTPS_ml 28 _mm_maskz_sqrt_ps(_mmask8 k,_ml 28 a, Int r); 

SIMD Floating-Point Exceptions 

Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 2; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instruction, see Exceptions Type E2. 

#UD If EVEX.vvvv != llllB. 
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SQRTSD—Compute Square Root of Scalar Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

Gn 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F2 0F51/r 

SQRTSD xmm1,xmm2/m64 

RM 

V/V 

SSE2 

Computes square root of the low double-precision floating¬ 
point value in xmm2/m64 and stores the results in xmmi. 

VEX.NDS.128.F2.0F.WIG51/r 
VSQRTSD xmmi ,xmm2, 
xmm3/nn64 

RVM 

v/v 

AVX 

Computes square root of the low double-precision floating¬ 
point value in xmm3/m64 and stores the results in xmmi. 
Also, upper double-precision floating-point value 
(bits[127:64]) from xmm2 is copied to xmmi [127:64]. 

EVEX.NDS.LIG.F2.0F.W1 51/r 
VSQRTSD xmmi {k1]{z}, xmm2, 
xmm3/m64[er} 

T1S 

V/V 

AVX512F 

Computes square root of the low double-precision floating¬ 
point value in xmm3/m64 and stores the results in xmmi 
under writemask k1. Also, upper double-precision floating¬ 
point value (bits[127:64]) from xmm2 is copied to 
xmmi [127:64]. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Computes the square root of the low double-precision floating-point value in the second source operand and stores 
the double-precision floating-point result in the destination operand. The second source operand can be an XMM 
register or a 64-bit memory location. The first source and destination operands are XMM registers. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. The quadword at 
bits 127:64 of the destination operand remains unchanged. Bits (MAX_VL-1:64) of the corresponding destination 
register remain unchanged. 

VEX.128 and EVEX encoded versions: Bits 127:64 of the destination operand are copied from the corresponding 
bits of the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low quadword element of the destination operand is updated according to the 
writemask. 

Software should ensure VSQRTSD is encoded with VEX.L=0. Encoding VSQRTSD with VEX.L=1 may encounter 
unpredictable behavior across different processor generations. 
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Operation 

VSQRTSD (EVEX encoded version) 

IF (EVEX.b = 1) AND (SRC2 *ls register*) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ SQRT(SRC2[63:0]) 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[63:0] ^ 0 
FI; 

FI; 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

VSQRTSD (VEX.128 encoded version) 

DEST[63:0] ^SQRT(SRC2[63:0]) 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

SQRTSD (128-bit Legacy SSE version) 

DEST[63:0] ^SQRT(SRC[63:0]) 

DEST[MAX_VL-1:64] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSQRTSD_ml 28d _mm_sqrt_round_sd(_ml 28d a,_ml 28d b, int r); 

VSQRTSD_ml 28d _mm_mask_sqrt_round_sd(_ml 28d s,_mmask8 k,_ml 28d a,_ml 28d b, int r); 

VSQRTSD_ml 28d _mm_maskz_sqrt_round_sd(_mmask8 k,_ml 28d a,_ml 28d b, int r); 

SQRTSD_ml 28d _mm_sqrt_sd (_ml 28d a,_ml 28d b) 

SIMD Floating-Point Exceptions 

Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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SQRTSS—Compute Square Root of Scalar Single-Precision Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 51 /r 

SQRTSS xmmi, xmm2/m32 

RM 

V/V 

SSE 

Computes square root of the low single-precision floating-point 
value in xmm2/m32 and stores the results in xmmi. 

VEX.NDS.128.F3.0F.WIG51 /r 
VSQRTSS xmmi, xmm2, 
xmm3/m32 

RVM 

v/v 

AVX 

Computes square root of the low single-precision floating-point 
value in xmm3/m32 and stores the results in xmmi. Also, 
upper single-precision floating-point values (bits[127:32]) from 
xmm2 are copied to xmmi [127:32]. 

EVEX.NDS.LIG.F3.0F.W0 51 /r 
VSQRTSS xmmi [k1 }{z], xmm2, 
xmm3/m32[er} 

T1S 

V/V 

AVX512F 

Computes square root of the low single-precision floating-point 
value in xmm3/m32 and stores the results in xmmi under 
writemask k1. Also, upper single-precision floating-point values 
(bits[127:32]) from xmm2 are copied to xmmi [127:32]. 


Instruction Operand Encoding 


Qp/En 

Qperand 1 

Qperand 2 

Qperand 3 

Qperand 4 

RM 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Computes the square root of the low single-precision floating-point value in the second source operand and stores 
the single-precision floating-point result in the destination operand. The second source operand can be an XMM 
register or a 32-bit memory location. The first source and destination operands is an XMM register. 

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAX_VL- 
1:32) of the corresponding VMM destination register remain unchanged. 

VEX. 128 and EVEX encoded versions: Bits 127:32 of the destination operand are copied from the corresponding 
bits of the first source operand. Bits (MAX_VL-1:128) of the destination ZMM register are zeroed. 

EVEX encoded version: The low doubleword element of the destination operand is updated according to the 
writemask. 

Software should ensure VSQRTSS is encoded with VEX.L=0. Encoding VSQRTSS with VEX.L=1 may encounter 
unpredictable behavior across different processor generations. 
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Operation 

VSQRTSS (EVEX encoded version) 

IF (EVEX.b = 1) AND (SRC2 *ls register*) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

IF k1 [0] or *no writemask* 

THEN DEST[31:0] ^ SQRT(SRC2[31:0]) 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[31:0]^0 
FI; 

FI; 

DEST[127:31] ^SRCI [127:31] 

DEST[MAX_VL-1:128]^0 

VSQRTSS (VEX.128 encoded version) 

DEST[31:0] ^SQRT(SRC2[31:0]) 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

SQRTSS (128-bit Legacy SSE version) 

DEST[31:0] ^SQRT(SRC2[31:0]) 

DEST[MAX_VL-1:32] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSQRTSS_ml 28_mm_sqrt_round_ss(_ml 28 a,_ml 28 b, int r); 

VSQRTSS_ml 28_mm_mask_sqrt_round_ss(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b, int r); 

VSQRTSS_ml 28_mm_maskz_sqrt_round_ss(_mmask8 k,_ml 28 a,_ml 28 b, int r); 

SQRTSS_ml 28 _mm_sqrt_ss(_ml 28 a) 

SIMD Floating-Point Exceptions 

Invalid, Precision, Denormal 

Other Exceptions 

Non-EVEX-encoded instruction, see Exceptions Type 3. 

EVEX-encoded instruction, see Exceptions Type E3. 
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STAC—Set AC Flag in EFLAGS Register 


Opcode/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF 01 CB 

NP 

V/V 

SMAP 

Set the AC flag in the EFLAGS register. 

STAC 






Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Sets the AC flag bit in EFLAGS register. This may enable alignment checking of user-mode data accesses. This 
allows explicit supervisor-mode data accesses to user-mode pages even if the SNAP bit is set in the CR4 register. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. Attempts to execute STAC when 
CPL > 0 cause #UD. 

Operation 

EFLAGS.AC^ 1; 

Flags Affected 

AC set. Other flags are unaffected. 

Protected Mode Exceptions 

#UD If the LOCK prefix is used. 

If the CPL > 0. 

If CPUID.(EAX=07H, ECX=OH):EBX.SMAP[bit 20] = 0. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 

If CPUID.(EAX=07H, ECX=OH):EBX.SMAP[bit 20] = 0. 

Virtual-SOSe Mode Exceptions 

#UD The STAC instruction is not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

#UD If the LOCK prefix is used. 

If the CPL > 0. 

If CPUID.(EAX=07H, ECX=0H):EBX.SMAP[bit 20] = 0. 

e4-Bit Mode Exceptions 

#UD If the LOCK prefix is used. 

If the CPL > 0. 

If CPUID.(EAX=07H, ECX=0H):EBX.SMAP[bit 20] = 0. 
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STC—Set Carry Flag 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

F9 

STC 

NP 

Valid 

Valid 

Set CF flag. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Sets the CF flag in the EFLAGS register. Operation is the same in all modes. 

Operation 

CF^ 1; 

Flags Affected 

The CF flag is set. The OF, ZF, SF, AF, and PF flags are unaffected. 

Exceptions (All Operating Modes) 

#UD If the LOCK prefix is used. 


STC—Set Carry Flag 
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STD—Set Direction Flag 


Opcode 

Instruction 

Op/ 

En 

64-bit 

Mode 

Compat/ 
Leg Mode 

Description 

FD 

STD 

NP 

Valid 

Valid 

Set DF flag. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Sets the DF flag in the EFLAGS register. When the DF flag is set to 1, string operations decrement the index regis¬ 
ters (ESI and/or EDI). Operation is the same in all modes. 

Operation 

DF^ 1; 

Flags Affected 

The DF flag is set. The CF, OF, ZF, SF, AF, and PF flags are unaffected. 

Exceptions (All Operating Modes) 

#UD If the LOCK prefix is used. 
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STI—Set Interrupt Flag 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

FB 

STI 

NP 

Valid 

Valid 

Set interrupt flag; external, maskable 
interrupts enabled at the end of the next 
instruction. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

If protected-mode virtual interrupts are not enabled, STI sets the interrupt flag (IF) in the EFLAGS register. After 
the IF flag is set, the processor begins responding to external, maskable interrupts after the next instruction is 
executed. The delayed effect of this instruction is provided to allow interrupts to be enabled just before returning 
from a procedure (or subroutine). For instance, if an STI instruction is followed by an RET instruction, the RET 
instruction is allowed to execute before external interrupts are recognized^. If the STI instruction is followed by a 
CLI instruction (which clears the IF flag), the effect of the STI instruction is negated. 

The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI 
interrupts (and SMIs) may be blocked for one macroinstruction following an STI. 

When protected-mode virtual interrupts are enabled, CPL is 3, and lOPL is less than 3; STI sets the VIF flag in the 
EFLAGS register, leaving IF unaffected. 

Table 4-19 indicates the action of the STI instruction depending on the processor's mode of operation and the 
CPL/IOPL settings of the running program or procedure. 

Operation is the same in all modes. 


Table 4-19. Decision Table for STI Results 


CRO.PE 

EFLAGS.VM 

EFLAGS.IOPL 

CS.CPL 

CR4.PVI 

EFLAGS.VIP 

CR4.VME 

STI Result 

0 

X 

X 

X 

X 

X 

X 

IF= 1 

1 

0 

>CPL 

X 

X 

X 

X 

IF = 1 

1 

0 

<CPL 

3 

1 

X 

X 

II 

LL 

> 

1 

0 

<CPL 

<3 

X 

X 

X 

GP Fault 

1 

0 

<CPL 

X 

0 

X 

X 

GP Fault 

1 

0 

<CPL 

X 

X 

1 

X 

GP Fault 

1 

1 

3 

X 

X 

X 

X 

IF= 1 

1 

1 

<3 

X 

X 

0 

1 

II 

UL 

> 

1 

1 

<3 

X 

X 

1 

X 

GP Fault 

1 

1 

<3 

X 

X 

X 

0 

GP Fault 


NOTES: 

X = This setting has no impact. 


1. The STI instruction delays recognition of interrupts only if it is executed with EFLAGS.IF = 0. In a sequence of STI Instructions, only 
the first instruction in the sequence is guaranteed to delay Interrupts. 


In the following instruction sequence, interrupts may be recognized before RET executes: 

STI 

STI 

RET 


STI—Set Interrupt Flag 
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Operation 

IF PE = 0 (* Executing In real-address mode *) 

THEN 

IF ^ 1; (* Set Interrupt Flag *) 

ELSE (* Executing In protected mode or vlrtual-8086 mode *) 

IF VM = 0 (* Executing In protected mode*) 

THEN 

IF lOPL > CPL 
THEN 

IF <- 1; (* Set Interrupt Flag *) 

ELSE 

IF (lOPL < CPL) and (CPL = 3) and (PVI = 1) 

THEN 

VIF <- 1; (* Set Virtual Interrupt Flag *) 

ELSE 

#GP(0); 

FI; 

FI; 

ELSE (* Executing in Virtual-8086 mode *) 

IF lOPL =3 
THEN 

IF <- 1; (* Set Interrupt Flag *) 

ELSE 

IF ((lOPL < 3) and (VIP = 0) and (VME = 1)) 

THEN 

VIF 1; (* Set Virtual Interrupt Flag *) 

ELSE 

#GP(0); (* Trap to vlrtual-8086 monitor *) 


Flags Affected 

The IF flag is set to 1; or the VIF flag is set to 1. Other flags are unaffected. 

Protected Mode Exceptions 

#GP(0) If the CPL is greater (has less privilege) than the lOPL of the current program or procedure. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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STMXCSR—Store MXCSR Register State 


Opcode*/ 

Instruction 

Op/ 

En 

64/32 bit 

Mode 

Support 

CPUID 

Feature 

Flag 

Description 

OF AE /3 

STMXCSR m32 

M 

V/V 

SSE 

Store contents of MXCSR register to m32. 

VEX.LZ.OF.WIG AE /3 

VSTMXCSR m32 

M 

MN 

AVX 

Store contents of MXCSR register to m32. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Stores the contents of the MXCSR control and status register to the destination operand. The destination operand 
is a 32-bit memory location. The reserved bits in the MXCSR register are stored as Os. 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

VEX.L must be 0, otherwise instructions will #UD. 

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD. 

Operation 

m32 ^ MXCSR; 

Intel C/C++ Compiler Intrinsic Equivalent 

_mm_getcsr(vold) 

SIMD Floating-Point Exceptions 

None. 

Other Exceptions 

See Exceptions Type 5; additionally 
#UD IfVEX.L= 1, 

If VEX.vvvv llllB. 
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STOS/STOSB/STOSW/STOSD/STOSQ-Store String 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

AA 

STOS mS 

NA 

Valid 

Valid 

For legacy mode, store AL at address ES:(E)DI; 
For 64-bit mode store AL at address RDI or 

EDI. 

AB 

STOSm76 

NA 

Valid 

Valid 

For legacy mode, store AX at address ES:(E)DI; 
For 64-bit mode store AX at address RDI or 

EDI. 

AB 

STOS m32 

NA 

Valid 

Valid 

For legacy mode, store EAX at address 

ES:(E)DI; For 64-bit mode store EAX at address 
RDI or EDI. 

REX.W + AB 

STOS m64 

NA 

Valid 

N.E. 

Store RAX at address RDI or EDI. 

AA 

STOSB 

NA 

Valid 

Valid 

For legacy mode, store AL at address ES:(E)DI; 
For 64-bit mode store AL at address RDI or 

EDI. 

AB 

STOSW 

NA 

Valid 

Valid 

For legacy mode, store AX at address ES:(E)DI; 
For 64-bit mode store AX at address RDI or 

EDI. 

AB 

STOSD 

NA 

Valid 

Valid 

For legacy mode, store EAX at address 

ES:(E)DI; For 64-bit mode store EAX at address 
RDI or EDI. 

REX.W + AB 

STOSQ 

NA 

Valid 

N.E. 

Store RAX at address RDI or EDI. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NA 

NA 

NA 

NA 

NA 


Description 

In non-64-bit and default 64-bit mode; stores a byte, word, or doubleword from the AL, AX, or EAX register 
(respectively) into the destination operand. The destination operand is a memory location, the address of which is 
read from either the ES:EDI or ES:DI register (depending on the address-size attribute of the instruction and the 
mode of operation). The ES segment cannot be overridden with a segment override prefix. 

At the assembly-code level, two forms of the instruction are allowed: the "explicit-operands" form and the "no¬ 
operands" form. The explicit-operands form (specified with the STOS mnemonic) allows the destination operand to 
be specified explicitly. Here, the destination operand should be a symbol that indicates the size and location of the 
destination value. The source operand is then automatically selected to match the size of the destination operand 
(the AL register for byte operands, AX for word operands, EAX for doubleword operands). The explicit-operands 
form is provided to allow documentation; however, note that the documentation provided by this form can be 
misleading. That is, the destination operand symbol must specify the correct type (size) of the operand (byte, 
word, or doubleword), but it does not have to specify the correct location. The location is always specified by the 
ES:(E)DI register. These must be loaded correctly before the store string instruction is executed. 

The no-operands form provides "short forms" of the byte, word, doubleword, and quadword versions of the STOS 
instructions. Here also ES:(E)DI is assumed to be the destination operand and AL, AX, or EAX is assumed to be the 
source operand. The size of the destination and source operands is selected by the mnemonic: STOSB (byte read 
from register AL), STOSW (word from AX), STOSD (doubleword from EAX). 

After the byte, word, or doubleword is transferred from the register to the memory location, the (E)DI register is 
incremented or decremented according to the setting of the DF flag in the EFLAGS register. If the DF flag is 0, the 
register is incremented; if the DF flag is 1, the register is decremented (the register is incremented or decremented 
by 1 for byte operations, by 2 for word operations, by 4 for doubleword operations). 


4-648 Vol. 28 


STOS/STOSB/STOSW/STOSD/STOSQ-Store String 






















INSTRUCTION SET REFERENCE, M-U 


NOTE: To improve performance, more recent processors support modifications to the processor's operation during 
the string store operations initiated with STOS and STOSB. See Section 7.3.9.3 in the Intel® 64 and IA-32 Archi¬ 
tectures Software Developer's Manual, Volume 1 for additional information on fast-string operation. 

In 64-bit mode, the default address size is 64 bits, 32-bit address size is supported using the prefix 67H. Using a 
REX prefix in the form of REX.W promotes operation on doubleword operand to 64 bits. The promoted no-operand 
mnemonic is STOSQ. STOSQ (and its explicit operands variant) store a quadword from the RAX register into the 
destination addressed by RDI or EDI. See the summary chart at the beginning of this section for encoding data and 
limits. 

The STOS, STOSB, STOSW, STOSD, STOSQ instructions can be preceded by the REP prefix for block loads of ECX 
bytes, words, or doublewords. More often, however, these instructions are used within a LOOP construct because 
data needs to be moved into the AL, AX, or EAX register before it can be stored. See "REP/REPE/REPZ 
/REPNE/REPNZ—Repeat String Operation Prefix" in this chapter for a description of the REP prefix. 

Operation 

Non-64-bit Mode: 

IF (Byte store) 

THEN 

BEST ^ AL; 

THENIFDF=0 

THEN (E)DI^(E)DI-h1; 

ELSE (E)DI^(E)DI- 1; 

FI; 

ELSE IF (Word store) 

THEN 

BEST ^ AX; 

THENIFBF = 0 

THEN (E)BI ^ (E)BI -r 2; 

ELSE (E)BI ^ (E)BI - 2; 

FI; 

FI; 

ELSE IF (Boubleword store) 

THEN 

BEST ^ EAX; 

THENIFBF = 0 

THEN (E)BI ^ (E)BI -r 4; 

ELSE (E)BI ^ (E)BI - 4; 

FI; 

FI; 

FI; 

64-blt Mode: 

IF (Byte store) 

THEN 

BEST ^ AL; 

THENIFBF=0 
THEN (R|E)BI 
ELSE (R|E)BI 
FI; 

ELSE IF (Word store) 

THEN 

BEST ^ AX; 


-(R|E)BI-h1; 

(R|E)BI-1; 
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THENIFDF = 0 

THEN (R|E)DI ^ (R|E)DI + 2; 

ELSE (R|E)DI ^ (R|E)DI - 2; 

FI; 

FI; 

ELSE IF (Doubleword store) 

THEN 

DEST ^ EAX; 

THENIFDF = 0 

THEN (R|E)DI^(R|E)DI + 4; 

ELSE (R|E)DI ^ (R|E)DI - 4; 

FI; 

FI; 

ELSE IF (Quadword store using REX.W) 

THEN 

DEST ^ RAX; 

THENIFDF = 0 

THEN (R|E)DI ^ (R|E)DI + 8; 

ELSE (R|E)DI ^ (R|E)DI - 8; 

FI; 

FI; 

FI; 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the limit of the ES segment. 

If the ES register contains a NULL segment selector. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the ES segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the ES segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form. 
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#PF(fault-code) 

#AC(0) 

#UD 


If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

If the LOCK prefix is used. 
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STR—Store Task Register 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 00 /I 

STR r/m 7 6 

M 

Valid 

Valid 

Stores segment selector from TR in r/ml6. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

M 

ModRM:r/m (w) 

NA 

NA 

NA 


Description 

Stores the segment selector from the task register (TR) in the destination operand. The destination operand can be 
a general-purpose register or a memory location. The segment selector stored with this instruction points to the 
task state segment (TSS) for the currently running task. 

When the destination operand is a 32-bit register, the 16-bit segment selector is copied into the lower 16 bits of the 
register and the upper 16 bits of the register are cleared. When the destination operand is a memory location, the 
segment selector is written to memory as a 16-bit quantity, regardless of operand size. 

In 64-bit mode, operation is the same. The size of the memory operand is fixed at 16 bits. In register stores, the 2- 
byte TR is zero extended if stored to a 64-bit register. 

The STR instruction is useful only in operating-system software. It can only be executed in protected mode. 

Operation 

DEST TR(SegmentSelector); 

Flags Affected 

None. 

Protected Mode Exceptions 

#GP(0) If the destination is a memory operand that is located in a non-writable segment or if the 

effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment 
selector. 

If CR4.UMIP = 1 and CPL > 0. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#UD The STR instruction is not recognized in real-address mode. 

Virtual-SOSe Mode Exceptions 

#UD The STR instruction is not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 
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64-Bit Mode Exceptions 

#GP(0) If the memory address is in a non-canonical form 


#SS(0) 

#PF(fault-code) 

#AC(0) 

If CR4.UMIP = 1 and CPL > 0. 

If the stack address is in a non-canonical form. 

If a page fault occurs. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 

#UD 

If the LOCK prefix is used. 


STR—Store Task Register 
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SUB—Subtract 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

ZCib 

SUB AL, imm8 

1 

Valid 

Valid 

Subtract imm8 from AL. 

ZD iw 

SUB AX, \mm16 

1 

Valid 

Valid 

Subtract imm 7 6 from AX. 

ZD id 

SUB EAX, imm32 

1 

Valid 

Valid 

Subtract imm32 from EAX. 

REX.W + ZD id 

SUB RAX, \mm3Z 

1 

Valid 

N.E. 

Subtract imm32 sign-extended to 64-bits 
from RAX. 

80 /5 ib 

SUB r/mS, immS 

Ml 

Valid 

Valid 

Subtract imm8 from r/mS. 

REX + 80 /5 ib 

SUB r/mS* imm8 

Ml 

Valid 

N.E. 

Subtract imm8 from r/mS. 

81 /5 iw 

SUB r/m 16 , imm 16 

Ml 

Valid 

Valid 

Subtract imm 7 6 from r/m 7 6. 

81 /5 id 

SUB r/m3Z, imm32 

Ml 

Valid 

Valid 

Subtract imm32 from r/m32. 

REX.W + 81/5 id 

SUB r/m64, imm3Z 

Ml 

Valid 

N.E. 

Subtract imm32 sign-extended to 64-bits 
from r/m64. 

83 /5 ib 

SUB r/m 16 , imm8 

Ml 

Valid 

Valid 

Subtract sign-extended imm8 from r/m 7 6. 

83 /5 ib 

SUB r/m32, imm8 

Ml 

Valid 

Valid 

Subtract sign-extended imm8 from r/m32. 

REX.W + 83 /5 ib 

SUB r/m64, imm8 

Ml 

Valid 

N.E. 

Subtract sign-extended imm8 from r/m64. 

28 /r 

SUB r/mS, rS 

MR 

Valid 

Valid 

Subtract r8 from r/mS. 

REX + 28 k 

SUB r/m8*, r8* 

MR 

Valid 

N.E. 

Subtract r8 from r/mS. 

29 /r 

SUB r/mi6,rl6 

MR 

Valid 

Valid 

Subtract r16 from r/m 7 6. 

29 /r 

SUB r/m32, r32 

MR 

Valid 

Valid 

Subtract r32 from r/m32. 

REX.W + 29 /r 

SUB r/m64, r64 

MR 

Valid 

N.E. 

Subtract r64 from r/m64. 

2A/r 

SUB rS, r/mS 

RM 

Valid 

Valid 

Subtract r/mS from rS. 

REX + 2A /r 

SUB r8* r/m8* 

RM 

Valid 

N.E. 

Subtract r/mS from rS. 

28/r 

SUB r16 , r/ml 6 

RM 

Valid 

Valid 

Subtract r/m 7 6 from r7 6 . 

28/r 

SUB r32, r/m32 

RM 

Valid 

Valid 

Subtract r/m32 from r32. 

REX.W + 28 /r 

SUB r64, r/m64 

RM 

Valid 

N.E. 

Subtract r/m64 from r64. 


NOTES: 

* In 64-blt mode, r/m8 can not be encoded to access the following byte registers If a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

1 

AL/AX/EAX/RAX 

imm8/26/32 

NA 

NA 

Ml 

ModRM:r/m (r, w) 

imm8/26/32 

NA 

NA 

MR 

ModRM:r/m (r, w) 

ModRM:reg (r) 

NA 

NA 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Subtracts the second operand (source operand) from the first operand (destination operand) and stores the result 
in the destination operand. The destination operand can be a register or a memory location; the source operand 
can be an immediate, register, or memory location. (However, two memory operands cannot be used in one 
instruction.) When an immediate value is used as an operand, it is sign-extended to the length of the destination 
operand format. 
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The SUB instruction performs integer subtraction. It evaluates the result for both signed and unsigned integer 
operands and sets the OF and CF flags to indicate an overflow in the signed or unsigned result, respectively. The SF 
flag indicates the sign of the signed result. 

In 64-bit mode, the instruction's default operation size is 32 bits. Using a REX prefix in the form of REX.R permits 
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See 
the summary chart at the beginning of this section for encoding data and limits. 

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. 

Operation 

DEST ^ (DEST - SRC); 

Flags Affected 

The OF, SF, ZF, AF, PF, and CF flags are set according to the result. 

Protected Mode Exceptions 

#GP(0) If the destination is located in a non-writable segment. 

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the 

#SS If a memory operand effective address is outside the 

#UD If the LOCK prefix is used but the destination is not a 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used but the destination is not a memory operand. 


CS, DS, ES, FS, or GS segment limit. 
SS segment limit, 
memory operand. 
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SUBPD—Subtract Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF 5C/r 

SUBPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Subtract packed double-precision floating-point values 
in xmm2/mem from xmmi and store result in xmmi. 

VEX.NDS.128.66.0F.WIG 5C/r 

VSUBPD xmmi ,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Subtract packed double-precision floating-point values 
in xmm3/mem from xmm2 and store result in xmmi. 

VEX.NDS.256.66.0F.WIG 5C /r 

VSUBPD ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Subtract packed double-precision floating-point values 
in ymm3/mem from ymm2 and store result in ymmi. 

EVEX.NDS.128.66.0F.W1 5C/r 

VSUBPD xmmi [k1 }{z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Subtract packed double-precision floating-point values 
from xmm3/m128/m64bcst to xmm2 and store result 
in xmmi with writemask k1. 

EVEX.NDS.256.66.0F.W1 5C/r 

VSUBPD ymmi {k1]{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Subtract packed double-precision floating-point values 
from ymm3/m256/m64bcst to ymm2 and store result 
in ymmi with writemask k1. 

EVEX.NDS.512.66.0F.W1 5C/r 

VSUBPD zmmi {k1}{z}, zmm2, 
zmm3/m512/m64bcst{er} 

FV 

v/v 

AVX512F 

Subtract packed double-precision floating-point values 
from zmm3/m512/m64bcst to zmm2 and store result in 
zmmi with writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD subtract of the two, four or eight packed double-precision floating-point values of the second 
Source operand from the first Source operand, and stores the packed double-precision floating-point results in the 
destination operand. 

VEX. 128 and EVEX.128 encoded versions: The second source operand is an XMM register or an 128-bit memory 
location. The first source operand and destination operands are XMM registers. Bits (MAX_VL-1:128) of the corre¬ 
sponding destination register are zeroed. 

VEX.256 and EVEX.256 encoded versions: The second source operand is an VMM register or an 256-bit memory 
location. The first source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corre¬ 
sponding destination register are zeroed. 

EVEX.512 encoded version: The second source operand is a ZMM register, a 512-bit memory location or a 512-bit 
vector broadcasted from a 64-bit memory location. The first source operand and destination operands are ZMM 
registers. The destination operand is conditionally updated according to the writemask. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper Bits (MAX_VL-1:128) of the corresponding 
register destination are unmodified. 
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Operation 

VSUBPD (EVEX encoded versions) when src2 operand is a vector register 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IF(VL=512) AND (EVEX.b = 1) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

FOR) ^0 TO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* 

THEN DEST[I+63:I] ^ SRC1 [i+63:l] - SRC2[l+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[63:0] ^ 0 
FI; 

FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VSUBPD (EVEX encoded versions) when src2 operand is a memory source 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR) ^0 TO KL-1 
I ^ j * 64 

IF k10] OR *no writemask* THEN 
IF (EVEX.b = 1) 

THEN DEST[i+63:i] ^ SRC1 [1+63:1] - SRC2[63:0]; 

ELSE EST[i+63:i] ^ SRC1 [1+63:1] - SRC2[i+63:i]; 

FI; 

ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[63:0] ^ 0 
FI; 

FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 


VSUBPD (VEX.256 encoded version) 

DEST[63:0] ^ SRC1[63:0] - SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64] - SRC2[127:64] 
DEST[191:128] ^ SRC1 [191:128] - SRC2[191:128] 
DEST[255:192] ^ SRC1 [255:192] - SRC2[255:192] 
DEST[MAX_VL-1:256]^0 


SUBPD—Subtract Packed Double-Precision Floating-Point Values 
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USUBPD {VEX.128 encoded version) 

DEST[63:0] ^ SRC1 [63:0] - SRC2[63:0] 

DEST[127:64] ^ SRC1 [127:64] - SRC2[127:64] 

DEST[MAX_VL-1:128]^0 

SUBPD (128-bit Legacy SSE version) 

DEST[63:0] ^ DEST[63:0] - SRC[63:0] 

DEST[127:64] ^ DEST[127:64] - SRC[127:64] 

DEST[MAX_VL-1:128] (Unmodified) 

Intei C/C-i-i- Compiier Intrinsic Equivaient 

VSUBPD _m512d _mm512_sub_pd (_m512d a,_m512d b); 

VSUBPD_m512d_mm512_mask_sub_pd (_mSI 2d s,_mmaskB k,_mSI 2d a,_mSI 2d b); 

VSUBPD_m512d _mm512_maskz_sub_pd (_mmaskB k,_m512d a,_m512d b); 

VSUBPD _m512d _mm512_sub_round_pd (_m512d a, _m512d b, Int); 

VSUBPD_m512d _mm512_mask_sub_round_pd (_mSI 2d s,_mmaskB k,_m512d a,_mSI 2d b, int); 

VSUBPD_m512d _mm512_maskz_sub_round_pd (_mmaskS k,_m512d a,_m512d b, int); 

VSUBPD _m256d _mm256_sub_pd (_m256d a, _m256d b); 

VSUBPD_m256d _mm256_mask_sub_pd (_m256d s,_mmaskS k,_m256d a,_m256d b); 

VSUBPD_m256d _mm256_maskz_sub_pd (_mmaskB k,_m256d a,_m256d b); 

SUBPD _m128d _mm_sub_pd (_m128d a, _m128d b); 

VSUBPD_ml 28d _mm_mask_sub_pd (_ml 28d s,_mmaskB k,_ml 28d a,_ml 28d b); 

VSUBPD_ml 28d _mm_maskz_sub_pd (_mmaskB k,_ml 28d a,_ml 28d b); 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

VEX-encoded instructions, see Exceptions Type 2. 

EVEX-encoded instructions, see Exceptions Type E2. 
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SUBPS—Subtract Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 5C /r 

SUBPS xmmi, xmm2/m128 

RM 

V/V 

SSE 

Subtract packed single-precision floating-point values 
in xmm2/mem from xmmi and store result in xmmi. 

VEX.NDS.128.0F.WIG 5C/r 

VSUBPS xmm1,xmm2, xmm3/m128 

RVM 

v/v 

AVX 

Subtract packed single-precision floating-point values 
in xmm3/mem from xmm2 and stores result in xmmi. 

VEX.NDS.256.0F.WIG 5C /r 

VSUBPS ymmi, ymm2, ymm3/m256 

RVM 

V/V 

AVX 

Subtract packed single-precision floating-point values 
in ymm3/mem from ymm2 and stores result in ymmi. 

EVEX.NDS.128.0F.W0 5C/r 

VSUBPS xmmi {k1}{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Subtract packed single-precision floating-point values 
from xmm3/m128/m32bcst to xmm2 and stores 
result in xmmi with writemask k1. 

EVEX.NDS.256.0F.W0 5C /r 

VSUBPS ymmi [k1 }[z], ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Subtract packed single-precision floating-point values 
from ymm3/m256/m32bcst to ymm2 and stores 
result in ymmi with writemask k1. 

EVEX.NDS.512.0F.W0 5C/r 

VSUBPS zmmi [k1 }{z], zmm2, 
zmm3/m512/m32bcst[er} 

FV 

v/v 

AVX512F 

Subtract packed single-precision floating-point values 
in zmm3/m512/m32bcst from zmm2 and stores result 
in zmmi with writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs a SIMD subtract of the packed single-precision floating-point values in the second Source operand from 
the First Source operand, and stores the packed single-precision floating-point results in the destination operand. 

VEX. 128 and EVEX.128 encoded versions: The second source operand is an XMM register or an 128-bit memory 
location. The first source operand and destination operands are XMM registers. Bits (MAX_VL-1:128) of the corre¬ 
sponding destination register are zeroed. 

VEX.256 and EVEX.256 encoded versions: The second source operand is an VMM register or an 256-bit memory 
location. The first source operand and destination operands are VMM registers. Bits (MAX_VL-1:256) of the corre¬ 
sponding destination register are zeroed. 

EVEX.512 encoded version: The second source operand is a ZMM register, a 512-bit memory location or a 512-bit 
vector broadcasted from a 32-bit memory location. The first source operand and destination operands are ZMM 
registers. The destination operand is conditionally updated according to the writemask. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper Bits (MAX_VL-1:128) of the corresponding 
register destination are unmodified. 
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Operation 

VSUBPS (EVEX encoded versions) when srcZ operand is a vector register 

(KL, VL) = (4,128), (8, 256), (16, 512) 

IF(VL=512) AND (EVEX.b = 1) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ SRC1 [1+31 :i] - SRC2[I+31 :l] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[31:0]^0 
FI; 

FI; 

ENDFOR; 

DEST[MAX_VL-1 :VL] ^ 0 

VSUBPS (EVEX encoded versions) when src2 operand is a memory source 

(KL, VL) = (4,128), (8, 256),(16, 512) 

FOR] ^0 TO KL-1 
i * 32 

IF k1 G] OR *no writemask* THEN 
IF(EVEX.b = 1) 

THEN DEST[i+31 :l] ^ SRC1 [i+31 :i] - SRC2[31:0]; 

ELSE DEST[i+31 :l] ^ SRC1 [i+31 :i] - SRC2[i+31 :i]; 

FI; 


ELSE 

IF *merging-masking* ; merging-masking 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

DEST[31:0]^0 
FI; 

FI; 

ENDFOR; 

DEST[MAX_VL-1:VL]^0 


VSUBPS (VEX.256 encoded version) 

DEST[31:0] ^ SRC1 [31:0] - SRC2[31:0] 

DEST[63:32] ^ SRC1 [63:32] - SRC2[63:32] 
DEST[95:64] ^ SRC1 [95:64] - SRC2[95:64] 

DEST[127:96] ^ SRC1 [127:96] - SRC2[127:96] 
DEST[159:128] ^ SRC1 [159:128] - SRC2[159:128] 
DEST[191:160]^ SRC1 [191:160] - SRC2[191:160] 
DEST[223:192] ^ SRC1 [223:192] - SRC2[223:192] 
DEST[255:224] ^ SRC1 [255:224] - SRC2[255:224]. 
DEST[MAX_VL-1:256]^0 
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VSUBPS (VEX.128 encoded version) 

DEST[31:0] ^ SRC1 [31:0] - SRC2[31:0] 

DEST[63:32] ^ SRC1 [63:32] - SRC2[63:32] 

DEST[95:64] ^ SRC1 [95:64] - SRC2[95:64] 

DEST[127:96] ^ SRC1 [127:96] - SRC2[127:96] 

DEST[MAX_VL-1:128]^0 

SUBPS (128-bit Legacy SSE version) 

DEST[31:0] ^ SRC1 [31:0] - SRC2[31:0] 

DEST[63:32] ^ SRC1 [63:32] - SRC2[63:32] 

DEST[95:64] ^ SRC1 [95:64] - SRC2[95:64] 

DEST[127:96] ^ SRC1 [127:96] - SRC2[127:96] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSUBPS _m512 _mm512_sub_ps (_m512 a, _m512 b); 

VSUBPS_m512_mm512_mask_sub_ps (_m512 s,_mmask16 k,_m512 a,_m512 b); 

VSUBPS_m512 _mm512_maskz_sub_ps (_mmaski 6 k,_m512 a,_m512 b); 

VSUBPS_m512 _mm512_sub_round_ps (_m512 a,_m512 b, int); 

VSUBPS_m512 _mm512_mask_sub_round_ps (_m512 s,_mmaski 6 k,_m512 a,_m512 b, int); 

VSUBPS_m512_mm512_maskz_sub_round_ps (_mmaski 6 k,_m512 a,_m512 b, int); 

VSUBPS _m256 _mm256_sub_ps (_m256 a, _m256 b); 

VSUBPS_m256 _mm256_mask_sub_ps (_m256 s,_mmaskB k,_m256 a,_m256 b); 

VSUBPS_m256 _mm256_maskz_sub_ps (_mmaski 6 k,_m256 a,_m256 b); 

SUBPS_ml 28 _mm_sub_ps (_ml 28 a,_ml 28 b); 

VSUBPS_ml 28 _mm_mask_sub_ps (_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VSUBPS_ml 28 _mm_maskz_sub_ps (_mmaski 6 k,_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

VEX-encoded instructions, see Exceptions Type 2. 

EVEX-encoded instructions, see Exceptions Type E2. 
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SUBSD—Subtract Scalar Double-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

F2 OF 5C /r 

SUBSD xmnnl, xmm2/m64 

RM 

V/V 

SSE2 

Subtract the low double-precision floating-point value in 
xmm2/m64 from xmmi and store the result in xmmi. 

VEX.NDS.128.F2.0F.WIG 5C/r 

VSUBSD xmm1,xmm2, xmm3/m64 

RVM 

v/v 

AVX 

Subtract the low double-precision floating-point value in 
xmm3/m64 from xmm2 and store the result in xmmi. 

EVEX.NDS.LIG.F2.0F.W1 5C/r 

VSUBSD xmmi {k1}{z}, xmnn2, 
xmm3/m64[er} 

T1S 

V/V 

AVX512F 

Subtract the low double-precision floating-point value in 
xmm3/m64 from xmm2 and store the result in xmmi 
under writemask k1. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Subtract the low double-precision floating-point value in the second source operand from the first source operand 
and stores the double-precision floating-point result in the low quadword of the destination operand. 

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination 
operands are XMM registers. 

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAX_VL-1:64) of the 
corresponding destination register remain unchanged. 

VEX. 128 and EVEX encoded versions: Bits (127:64) of the XMM register destination are copied from corresponding 
bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low quadword element of the destination operand is updated according to the 
writemask. 

Software should ensure VSUBSD is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 
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Operation 

VSUBSD (EVEX encoded version) 

IF (SRC2 *is register*) AND (EVEX.b = 1) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

IF k1 [0] or *no writemask* 

THEN DEST[63:0] ^ SRC1 [63:0] - SRC2[63:0] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[63:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[63:0] ^ 0 
FI; 

FI; 

DEST[127:64] ^ SRC1 [127:64] 

DEST[MAX_VL-1:128]^0 

VSUBSD (VEX.128 encoded version) 

DEST[63:0] ^SRCI [63:0] - SRC2[63:0] 

DEST[127:64] ^SRCI [127:64] 

DEST[MAX_VL-1:128] ^0 

SUBSD (128-bit Legacy SSE version) 

DEST[63:0] ^DEST[63:0] - SRC[63:0] 

DEST[MAX_VL-1:64] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSUBSD_ml 28d _mm_mask_sub_sd (_ml 28d s,_mmask8 k,_ml 28d a,_ml 28d b); 

VSUBSD_ml 28d _mm_maskz_sub_sd (_mmaskB k,_ml 28d a,_ml 28d b); 

VSUBSD_ml 28d _mm_sub_round_sd (_ml 28d a,_ml 28d b, int); 

VSUBSD_ml 28d _mm_mask_sub_round_sd (_ml 28d s,_mmaskB k,_ml 28d a,_ml 28d b, int); 

VSUBSD_ml 28d _mm_maskz_sub_round_sd (_mmaskB k,_ml 28d a,_ml 28d b, int); 

SUBSD _m128d _mm_sub_sd (_m128d a_ml 28d b); 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

VEX-encoded instructions, see Exceptions Type 3. 

EVEX-encoded instructions, see Exceptions Type E3. 
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SUBSS—Subtract Scalar Single-Precision Floating-Point Value 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

F3 OF 5C /r 

SUBSS xmmi, xmm2/m32 

RM 

V/V 

SSE 

Subtract the low single-precision floating-point value in 
xmm2/m32 from xmmi and store the result in xmmi. 

VEX.NDS.128.F3.0F.WIG 5C/r 

VSUBSS xmm1,xmm2, xmm3/m32 

RVM 

v/v 

AVX 

Subtract the low single-precision floating-point value in 
xmm3/m32 from xmm2 and store the result in xmmi. 

EVEX.NDS.LIG.F3.0F.W0 5C /r 

VSUBSS xmmi {k1}{z}, xmm2, 
xmm3/m32[er} 

T1S 

V/V 

AVX512F 

Subtract the low single-precision floating-point value in 
xmm3/m32 from xmm2 and store the result in xmmi 
under writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

T1S 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Subtract the low single-precision floating-point value from the second source operand and the first source operand 
and store the double-precision floating-point result in the low doubleword of the destination operand. 

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination 
operands are XMM registers. 

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAX_VL-1:32) of the 
corresponding destination register remain unchanged. 

VEX. 128 and EVEX encoded versions: Bits (127:32) of the XMM register destination are copied from corresponding 
bits in the first source operand. Bits (MAX_VL-1:128) of the destination register are zeroed. 

EVEX encoded version: The low doubleword element of the destination operand is updated according to the 
writemask. 

Software should ensure VSUBSS is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpre¬ 
dictable behavior across different processor generations. 
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Operation 

VSUBSS (EVEX encoded version) 

IF (SRC2 *is register*) AND (EVEX.b = 1) 

THEN 

SET_RM(EVEX.RC); 

ELSE 

SET_RM(MXCSR.RM); 

FI; 

IF k1 [0] or *no writemask* 

THEN DEST[31:0] ^ SRC1 [31:0] - SRC2[31:0] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[31:0] remains unchanged* 

ELSE ; zeroing-masking 

THEN DEST[31:0]^0 
FI; 

FI; 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128]^0 

VSUBSS (VEX.128 encoded version) 

DEST[31:0] ^SRCI [31:0] - SRC2[31:0] 

DEST[127:32] ^SRCI [127:32] 

DEST[MAX_VL-1:128] ^0 

SUBSS (128-bit Legacy SSE version) 

DEST[31:0] ^DEST[31:0] - SRC[31:0] 

DEST[MAX_VL-1:32] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VSUBSS_ml 28 _mm_mask_sub_ss (_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VSUBSS_ml 28 _mm_maskz_sub_ss (_mmask8 k,_ml 28 a,_ml 28 b); 

VSUBSS_ml 28 _mm_sub_round_ss (_ml 28 a,_ml 28 b, int); 

VSUBSS_ml 28 _mm_mask_sub_round_ss (_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b, int); 

VSUBSS_ml 28 _mm_maskz_sub_round_ss (_mmask8 k,_ml 28 a,_ml 28 b, int); 

SUBSS _m128 _mm_sub_ss (_m128 a, _m128 b); 

SIMD Floating-Point Exceptions 

Overflow, Underflow, Invalid, Precision, Denormal 

Other Exceptions 

VEX-encoded instructions, see Exceptions Type 3. 

EVEX-encoded instructions, see Exceptions Type E3. 
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SWAPGS—Swap GS Base Register 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 01 F8 

SWAPGS 

NP 

Valid 

Invalid 

Exchanges the current GS base register value 
with the value contained in MSR address 
C0000102H. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

SWAPGS exchanges the current GS base register value with the value contained in MSR address C0000102H 
(IA32_KERNEL_GS_BASE). The SWAPGS instruction is a privileged instruction intended for use by system soft¬ 
ware. 

When using SYSCALL to implement system calls, there is no kernel stack at the OS entry point. Neither is there a 
straightforward method to obtain a pointer to kernel structures from which the kernel stack pointer could be read. 
Thus, the kernel cannot save general purpose registers or reference memory. 

By design, SWAPGS does not require any general purpose registers or memory operands. No registers need to be 
saved before using the instruction. SWAPGS exchanges the CPL 0 data pointer from the IA32_KERNEL_GS_BASE 
MSR with the GS base register. The kernel can then use the GS prefix on normal memory references to access 
kernel data structures. Similarly, when the OS kernel is entered using an interrupt or exception (where the kernel 
stack is already set up), SWAPGS can be used to quickly get a pointer to the kernel data structures. 

The IA32_KERNEL_GS_BASE MSR itself is only accessible using RDMSR/WRMSR instructions. Those instructions 
are only accessible at privilege level 0. The WRMSR instruction ensures that the IA32_KERNEL_GS_BASE MSR 
contains a canonical address. 

Operation 

IF CS.L 1 (* Not in 64-Bit Mode *) 

THEN 

#UD; FI; 

IFCPL?iO 

THEN #GP(0); FI; 

tmp <- GS.base; 

GS.base ^ IA32_KERNEL_GS_BASE; 

IA32_KERNEL_GS_BASE ^ tmp; 

Flags Affected 

None 

Protected Mode Exceptions 

#UD If Mode 64-Bit. 

Real-Address Mode Exceptions 

#UD If Mode 64-Bit. 

Virtual-SOSe Mode Exceptions 

#UD If Mode 64-Bit. 
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Compatibility Mode Exceptions 

#UD If Mode 64-Bit. 

e4-Bit Mode Exceptions 

#GP(0) IfCPL?iO. 

If the LOCK prefix is used. 


SWAPCS—Swap CS Base Register 


Vol. 2B 4-667 


INSTRUCTION SET REFERENCE, M-U 


SYSCALL—Fast System Call 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 05 

SYSCALL 

NP 

Valid 

Invalid 

Fast call to privilege level 0 system 
procedures. 


Instruction Operand 

Encoding 

Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR 
MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures 
that the IA32_LSTAR MSR always contain a canonical address.) 

SYSCALL also saves RFLAGS into Rll and then masks RFLAGS using the IA32_FMASK MSR (MSR address 
C0000084H); specifically, the processor clears in RFLAGS every bit corresponding to a bit that is set in the 
IA32_FMASK MSR. 

SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of the IA32_STAR MSR. However, the 
CS and SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. 
Instead, the descriptor caches are loaded with fixed values. See the Operation section for details. It is the respon¬ 
sibility of OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values corre¬ 
spond to the fixed values loaded into the descriptor caches; the SYSCALL instruction does not ensure this 
correspondence. 

The SYSCALL instruction does not save the stack pointer (RSP). If the OS system-call handler will change the stack 
pointer, it is the responsibility of software to save the previous value of the stack pointer. This might be done prior 
to executing SYSCALL, with software restoring the stack pointer with the instruction following SYSCALL (which will 
be executed after SYSRET). Alternatively, the OS system-call handler may save the stack pointer and restore it 
before executing SYSRET. 

Operation 

IF (CS.L it]) 01 (|A32_EFER.LMA 1) or (IA32_EFER.SCE it 1) 

(* Not In 64-Bit Mode or SYSCALL/SYSRET not enabled In IA32_EFER *) 

THEN #UD; 

FI; 

RCX <- RIP; (* Will contain address of next Instruction *) 

RIP ^ IA32_LSTAR; 

R11 ^ RFLAGS; 

RFLAGS ^ RFLAGS AND NOT(IA32_FMASK); 

CS.Selector IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *) 

(* Set rest of CS to a fixed value *) 

CS.Base <- 0; (* Flat segment *) 

CS.LImit <- FFFFFH; (* With 4-KByte granularity, implies a 4-GByte limit *) 

CS.Type <-11; (* Execute/read code, accessed *) 

CS.S ^ 1; 

CS.DPL ^ 0; 

CS.P^ 1; 

CS.L ^ 1; 

CS.D ^ 0; 

CS.G ^ 1; 

CPL ^ 0; 


(* Entry is to 64-bit mode *) 
(* Reguired if CS.L = 1 *) 

(* 4-KByte granularity *) 
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SS.Selector ^ IA32_STAR[47:32] + 8; 

(* Set rest of SS to a fixed value *) 

SS.Base ^ 0; 

SS.Limit ^ FFFFFH; 

SS.Type ^ 3; 

SS.S ^ 1; 

SS.DPL ^ 0; 

SS.P^ 1; 

SS.B^ 1; 

SS.C ^ 1; 

Flags Affected 

All. 

Protected Mode Exceptions 

#UD The SYSCALL instruction is not recognized in protected mode. 

Real-Address Mode Exceptions 

#UD The SYSCALL instruction is not recognized in real-address mode. 

Virtual-SOSe Mode Exceptions 

#UD The SYSCALL instruction is not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

#UD The SYSCALL instruction is not recognized in compatibility mode. 

64-Bit Mode Exceptions 

#UD If IA32_EFER.SCE = 0. 

If the LOCK prefix is used. 


(* SS just above CS *) 

(* Flat segment *) 

(* With 4-KByte granularity, Implies a 4-GByte limit *) 
(* Read/write data, accessed *) 


(* 32-bit stack segment *) 
(* 4-KByte granularity *) 


SYSCALL-Fast System Call 
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SYSENTER-Fast System Call 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 34 

SYSENTER 

NP 

Valid 

Valid 

Fast call to privilege level 0 system 
procedures. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT. The 
instruction is optimized to provide the maximum performance for system calls from user code running at privilege 
level 3 to operating system or executive procedures running at privilege level 0. 

When executed in IA-32e mode, the SYSENTER instruction transitions the logical processor to 64-bit mode; other¬ 
wise, the logical processor remains in protected mode. 

Prior to executing the SYSENTER instruction, software must specify the privilege level 0 code segment and code 
entry point, and the privilege level 0 stack segment and stack pointer by writing values to the following MSRs: 

• I A32_ SYSENTER_ CS (MSR address 174H) — The lower 16 bits of this MSR are the segment selector for the 
privilege level 0 code segment. This value is also used to determine the segment selector of the privilege level 
0 stack segment (see the Operation section). This value cannot indicate a null selector. 

• I A32_ SYSENTER_ El P (MSR address 176H) — The value of this MSR is loaded into RIP (thus, this value 
references the first instruction of the selected operating procedure or routine). In protected mode, only 
bits 31:0 are loaded. 

• I A32_ SYSENTER_ ESP (MSR address 175H) — The value of this MSR is loaded into RSP (thus, this value 
contains the stack pointer for the privilege level 0 stack). This value cannot represent a non-canonical address. 
In protected mode, only bits 31:0 are loaded. 

These MSRs can be read from and written to using RDMSR/WRMSR. The WRMSR instruction ensures that the 
IA32_SYSENTER_EIP and IA32_SYSENTER_ESP MSRs always contain canonical addresses. 

While SYSENTER loads the CS and SS selectors with values derived from the IA32_SYSENTER_CS MSR, the CS and 
SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. Instead, 
the descriptor caches are loaded with fixed values. See the Operation section for details. It is the responsibility of 
OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values correspond to the 
fixed values loaded into the descriptor caches; the SYSENTER instruction does not ensure this correspondence. 

The SYSENTER instruction can be invoked from all operating modes except real-address mode. 

The SYSENTER and SYSEXIT instructions are companion instructions, but they do not constitute a call/return pair. 
When executing a SYSENTER instruction, the processor does not save state information for the user code (e.g., the 
instruction pointer), and neither the SYSENTER nor the SYSEXIT instruction supports passing parameters on the 
stack. 

To use the SYSENTER and SYSEXIT instructions as companion instructions for transitions between privilege level 3 
code and privilege level 0 operating system procedures, the following conventions must be followed: 

• The segment descriptors for the privilege level 0 code and stack segments and for the privilege level 3 code and 
stack segments must be contiguous in a descriptor table. This convention allows the processor to compute the 
segment selectors from the value entered in the SYSENTER_CS_MSR MSR. 

• The fast system call "stub" routines executed by user code (typically in shared libraries or DLLs) must save the 
required return IP and processor state information if a return to the calling procedure is required. Likewise, the 
operating system or executive procedures called with SYSENTER instructions must have access to and use this 
saved return and state information when returning to the user code. 
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The SYSENTER and SYSEXIT instructions were introduced into the IA-32 architecture in the Pentium II processor. 
The availability of these instructions on a processor is indicated with the SYSENTER/SYSEXIT present (SEP) feature 
flag returned to the EDX register by the CPUID instruction. An operating system that qualifies the SEP flag must 
also qualify the processor family and model to ensure that the SYSENTER/SYSEXIT instructions are actually 
present. For example: 

IF CPUID SEP bit is set 

THEN IF (Family = 6) and (Model < 3) and (Stepping < 3) 

THEN 

SYSENTER/SYSEXIT_Not_Supported;FI; 

ELSE 

SYSENTER/SYSEXIT_Supported; FI; 

FI; 

When the CPUID instruction is executed on the Pentium Pro processor (model 1), the processor returns a the SEP 
flag as set, but does not support the SYSENTER/SYSEXIT instructions. 


Operation 

IF CRO.PE = 0 OR IA32_SYSENTER_CS[15:2] = 0 THEN #GP(0); FI; 

RFLAGS.VM <- 0; (* Ensures protected mode execution *) 

RFLAGS.IF 0; (* Mask interrupts *) 

IF in IA-32e mode 
THEN 



RSP< 

- IA32_SYSENTER_ESP; 

ELSE 

RIP^ 

- IA32_SYSENTER_EIP; 


ESP 4 

- IA32_SYSENTER_ESP[31:0]; 

FI; 

EIP4- 

-IA32_SYSENTER_EIP[31:0]; 


CS.Selector ^ IA32_SYSENTER_CS[15:0] AND FFFCH; 

(* Operating system provides CS; RPL forced to 0 *) 


(* Set rest of CS to a fixed value *) 
CS.Base ^ 0; 

CS.Limit ^ FFFFFH; 

CS.Type ^11; 

CS.S^ 1; 

CS.DPL ^ 0; 

CS.P^ 1; 


(* Flat segment *) 

(* With 4-KByte granularity, Implies a 4-GByte limit *) 
(* Execute/read code, accessed *) 


IF in IA-32e mode 


THEN 

CS.L^ 1; 
CS.D ^ 0; 

ELSE 


CS.L ^ 0; 
CS.D^ 1; 
FI; 

CS.G ^ 1; 

CPL ^ 0; 


(* Entry Is to 64-blt mode *) 
(* Required If CS.L = 1 *) 


(* 32-bit code segment*) 
(* 4-KByte granularity *) 


SS.Selector CS.Selector -i- 8; (* SS just above CS *) 

(* Set rest of SS to a fixed value *) 

SS.Base <- 0; (* Flat segment *) 

SS.LImit <- FFFFFH; (* With 4-KByte granularity. Implies a 4-GByte limit *) 

SS.Type <- 3; (* Read/write data, accessed *) 
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SS.S^ 1; 

SS.DPL ^ 0; 

SS.P^ 1; 

SS.B 1; (* 32-blt stack segment*) 

SS.G <- 1; (* 4-KByte granularity *) 

Flags Affected 

VM, IF (see Operation above) 

Protected Mode Exceptions 

#GP(0) If IA32_SYSENTER_CS[15:2] = 0. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP The SYSENTER instruction is not recognized in real-address mode. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

Same exceptions as in protected mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

Same exceptions as in protected mode. 
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SYSEXIT—Fast Return from Fast System Call 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 35 

SYSEXIT 

NP 

Valid 

Valid 

Fast return to privilege level 3 user code. 

REX.W + OF 35 

SYSEXIT 

NP 

Valid 

Valid 

Fast return to 64-bit mode privilege level 3 
user code. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Executes a fast return to privilege level 3 user code. SYSEXIT is a companion instruction to the SYSENTER instruc¬ 
tion. The instruction is optimized to provide the maximum performance for returns from system procedures 
executing at protections levels 0 to user procedures executing at protection level 3. It must be executed from code 
executing at privilege level 0. 

With a 64-bit operand size, SYSEXIT remains in 64-bit mode; otherwise, it either enters compatibility mode (if the 
logical processor is in IA-32e mode) or remains in protected mode (if it is not). 

Prior to executing SYSEXIT, software must specify the privilege level 3 code segment and code entry point, and the 
privilege level 3 stack segment and stack pointer by writing values into the following MSR and general-purpose 
registers: 

• I A32_SYSENTER_CS (MSR address 174H) — Contains a 32-bit value that is used to determine the segment 
selectors for the privilege level 3 code and stack segments (see the Operation section) 

• RDX — The canonical address in this register is loaded into RIP (thus, this value references the first instruction 
to be executed in the user code). If the return is not to 64-bit mode, only bits 31:0 are loaded. 

• ECX — The canonical address in this register is loaded into RSP (thus, this value contains the stack pointer for 
the privilege level 3 stack). If the return is not to 64-bit mode, only bits 31:0 are loaded. 

The IA32_SYSENTER_CS MSR can be read from and written to using RDMSR and WRMSR. 

While SYSEXIT loads the CS and SS selectors with values derived from the IA32_SYSENTER_CS MSR, the CS and 
SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. Instead, 
the descriptor caches are loaded with fixed values. See the Operation section for details. It is the responsibility of 
OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values correspond to the 
fixed values loaded into the descriptor caches; the SYSEXIT instruction does not ensure this correspondence. 

The SYSEXIT instruction can be invoked from all operating modes except real-address mode and virtual-8086 
mode. 

The SYSENTER and SYSEXIT instructions were introduced into the IA-32 architecture in the Pentium II processor. 
The availability of these instructions on a processor is indicated with the SYSENTER/SYSEXIT present (SEP) feature 
flag returned to the EDX register by the CPUID instruction. An operating system that qualifies the SEP flag must 
also qualify the processor family and model to ensure that the SYSENTER/SYSEXIT instructions are actually 
present. For example: 

IF CPUID SEP bit is set 

THEN IF (Family = 6) and (Model < 3) and (Stepping < 3) 

THEN 

SYSENTER/SYSEXIT_Not_Supported;FI; 

ELSE 

SYSENTER/SYSEXIT_Supported; FI; 

FI; 

When the CPUID instruction is executed on the Pentium Pro processor (model 1), the processor returns a the SEP 
flag as set, but does not support the SYSENTER/SYSEXIT instructions. 
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Operation 

IF IA32_SYSENTER_CS[15:2] = 0 OR CRO.PE = 0 OR CPL ?!: 0 THEN #GP(0); FI; 

IF operand size is 64-bit 

THEN (* Return to 64-bit mode *) 

RSP ^ RCX; 

RIP ^ RDX; 

ELSE (* Return to protected mode or compatibility mode *) 

RSP ^ ECX; 

RIP ^ EDX; 

FI; 


IF operand size is 64-bit (* Operating system provides CS; RPL forced to 3 *) 

THEN CS.Selector ^ IA32_SYSENTER_CS[15:0] + 32; 

ELSE CS.Selector ^ IA32_SYSENTER_CS[15:0] + 16; 


FI; 

CS.Selector CS.Selector OR 3; 

(* Set rest of CS to a fixed value *) 
CS.Base ^ 0; 

CS.LImit ^ FFFFFH; 

CS.Type ^11; 


(* RPL forced to 3 *) 

(* Flat segment *) 

(* With 4-KByte granularity, implies a 4-GByte limit *) 
(* Execute/read code, accessed *) 


CS.S ^ 1; 

CS.DPL ^ 3; 

CS.P^ 1; 

IF operand size is 64-bit 

THEN (* return to 64-bit mode *) 

CS.L <- 1; (* 64-bit code segment *) 

CS.D ^ 0; (* Reguired if CS.L = 1 *) 

ELSE (* return to protected mode or compatibility mode *) 


CS.L 


0 ; 


CS.D^ 1; 
FI; 


(* 32-bit code segment*) 


CS.G <- 1; (* 4-KByte granularity *) 

CPL ^ 3; 


SS.Selector <- CS.Selector + 8; 

(* Set rest of SS to a fixed value *) 
SS.Base ^ 0; 

SS.LImIt ^ FFFFFH; 

SS.Type ^ 3; 

SS.S^ 1; 

SS.DPL^3; 

SS.P^ 1; 

SS.B^ 1; 

SS.G^ 1; 


(* SS ]ust above CS *) 

(* Flat segment *) 

(* With 4-KByte granularity, implies a 4-GByte limit *) 
(* Read/write data, accessed *) 


(* 32-bit stack segment*) 
(* 4-KByte granularity *) 


Flags Affected 

None. 


Protected Mode Exceptions 

#GP(0) If IA32_SYSENTER_CS[15:2] = 0. 

If CPL?:: 0. 

#UD If the LOCK prefix is used. 
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Real-Address Mode Exceptions 

#GP The SYSEXIT instruction is not recognized in real-address mode. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) The SYSEXIT instruction is not recognized in virtual-8086 mode. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

64-Bit Mode Exceptions 

#GP(0) If IA32_SYSENTER_CS = 0. 

If CPL?iO. 

If RCX or RDX contains a non-canonical address. 

#UD If the LOCK prefix is used. 
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SYSRET—Return From Fast System Call 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF 07 

SYSRET 

NP 

Valid 

Invalid 

Return to compatibility mode from fast 
system call 

REX.W + OF 07 

SYSRET 

NP 

Valid 

Invalid 

Return to 64-bit mode from fast system call 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

SYSRET is a companion instruction to the SYSCALL instruction. It returns from an OS system-call handler to user 
code at privilege level 3. It does so by loading RIP from RCX and loading RFLAGS from Rll.^ With a 64-bit operand 
size, SYSRET remains in 64-bit mode; otherwise, it enters compatibility mode and only the low 32 bits of the regis¬ 
ters are loaded. 

SYSRET loads the CS and SS selectors with values derived from bits 63:48 of the IA32_STAR MSR. However, the 
CS and SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. 
Instead, the descriptor caches are loaded with fixed values. See the Operation section for details. It is the respon¬ 
sibility of OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values corre¬ 
spond to the fixed values loaded into the descriptor caches; the SYSRET instruction does not ensure this 
correspondence. 

The SYSRET instruction does not modify the stack pointer (ESP or RSP). For that reason, it is necessary for software 
to switch to the user stack. The OS may load the user stack pointer (if it was saved after SYSCALL) before executing 
SYSRET; alternatively, user code may load the stack pointer (if it was saved before SYSCALL) after receiving control 
from SYSRET. 

If the OS loads the stack pointer before executing SYSRET, it must ensure that the handler of any interrupt or 
exception delivered between restoring the stack pointer and successful execution of SYSRET is not invoked with the 
user stack. It can do so using approaches such as the following: 

• External interrupts. The OS can prevent an external interrupt from being delivered by clearing EFLAGS.IF 
before loading the user stack pointer. 

• Nonmaskable interrupts (NMIs). The OS can ensure that the NMI handler is invoked with the correct stack by 
using the interrupt stack table (1ST) mechanism for gate 2 (NMI) in the IDT (see Section 6.14.5, "Interrupt 
Stack Table," in Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A). 

• General-protection exceptions (#GP). The SYSRET instruction generates #GP(0) if the value of RCX is not 
canonical. The OS can address this possibility using one or more of the following approaches: 

— Confirming that the value of RCX is canonical before executing SYSRET. 

— Using paging to ensure that the SYSCALL instruction will never save a non-canonical value into RCX. 

— Using the 1ST mechanism for gate 13 (#GP) in the IDT. 

Operation 

IF (CS.L 1 ) or (IA32_EFER.LMA 1) or (IA32_EFER.SCE 1) 

(* Not In 64-Bit Mode or SYSCALL/SYSRET not enabled In IA32_EFER *) 

THEN #UD; FI; 

IF (CPL 0) OR (RCX is not canonical) THEN #GP(0); FI; 


1. Regardless of the value of R11, the RF and VM flags are always 0 In RFLAGS after execution of SYSRET. In addition, all reserved bits 
in RFLAGS retain the fixed values. 
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IF (operand size Is 64-blt) 

TFIEN (* Return to 64-Blt Mode *) 

RIP ^ RCX; 

ELSE (* Return to Compatibility Mode ^ 
RIP ^ ECX; 

FI; 

RFLAGS ^ (R11 & 3C7FD7H) | 2; 


IF (operand size is 64-bit) 

THEN CS.Selector ^ IA32_STAR[63:48]+16; 
ELSE CS.Selector ^ IA32_STAR[63:48]; 


FI; 

CS.Selector CS.Selector OR 3; 

(* Set rest of CS to a fixed value *) 

CS.Base ^ 0; 

CS.Limit ^ FFFFFH; 

CS.Type ^11; 

CS.S^ 1; 

CS.DPL ^ 3; 

CS.P^ 1; 

IF (operand size is 64-bit) 

THEN (* Return to 64-8it Mode *) 
CS.L^ 1; 

CS.D ^ 0; 

ELSE (* Return to Compatibility Mode *) 
CS.L ^ 0; 

CS.D ^ 1; 

FI; 

CS.G ^ 1; 

CPL^3; 

SS.Selector ^ (IA32_STAR[63:48]+8) OR 3; 
(* Set rest of SS to a fixed value *) 

SS.Base ^ 0; 

SS.Limit^ FFFFFH; 

SS.Type ^ 3; 

SS.S ^ 1; 

SS.DPL ^ 3; 

SS.P^ 1; 

SS.8^ 1; 

SS.G ^ 1; 


(* Clear RF, VM, reserved bits; set bit 2 *) 

6 ; 


(* RPL forced to 3 *) 

(* Flat segment *) 

(* With 4-K8yte granularity, Implies a 4-G8yte limit *) 
(* Execute/read code, accessed *) 


(* 64-bit code segment *) 

(* Reguired if CS.L = 1 *) 

(* Compatibility mode *) 

(* 32-bit code segment *) 

(* 4-K8yte granularity *) 

(* RPL forced to 3 *) 

(* Flat segment *) 

(* With 4-K8yte granularity. Implies a 4-G8yte limit *) 
(* Read/write data, accessed *) 


(* 32-bit stack segment*) 
(* 4-K8yte granularity *) 


Flags Affected 

All. 

Protected Mode Exceptions 

#UD The SYSRET instruction is not recognized in protected mode. 

Real-Address Mode Exceptions 

#UD The SYSRET instruction is not recognized in real-address mode. 

\/irtual-8086 Mode Exceptions 

#UD The SYSRET instruction is not recognized in virtual-8086 mode. 
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Compatibility Mode Exceptions 

#UD The SYSRET instruction is not recognized in compatibility mode. 

e4-Bit Mode Exceptions 

#UD If IA32_EFER.SCE = 0. 

If the LOCK prefix is used. 

#GP(0) IfCPL?iO. 

If RCX contains a non-canonical address. 
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TEST—Logical Compare 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

A8 ib 

TEST AL, immS 

1 

Valid 

Valid 

AND /mmS with AL; set SF, ZF, PF according to 
result. 

A9 iw 

TEST AX, imm76 

1 

Valid 

Valid 

AND imm 16 with AX; set SF, ZF, PF according 
to result. 

A9/d 

TEST EAX, \mm32 

1 

Valid 

Valid 

AND /mmSZ with EAX; set SF, ZF, PF according 
to result. 

REX.W + A9 id 

TEST RAX, \mm32 

1 

Valid 

N.E. 

AND imm32 sign-extended to 64-bits with 

RAX; set SF, ZF, PF according to result. 

F6 /O ib 

TEST r/mS, immS 

Ml 

Valid 

Valid 

AND /mmS with r/mS, set SF, ZF, PF according 
to result. 

REX + F6 /O ib 

TEST r/mS* immS 

Ml 

Valid 

N.E. 

AND /mmS with r/mS, set SF, ZF, PF according 
to result. 

F7 /O iw 

TEST r/ml6, imm16 

Ml 

Valid 

Valid 

AND imm 16 with r/m 1 6; set SF, ZF, PF 
according to result. 

F7 /O id 

TEST r/m32, imm32 

Ml 

Valid 

Valid 

AND imm32 with r/m32: set SF, ZF, PF 
according to result. 

REX.W + F7 /O id 

TEST r/m64, imm32 

Ml 

Valid 

N.E. 

AND imm32 sign-extended to 64-bits with 
r/m64; set SF, ZF, PF according to result. 

84 /r 

TEST r/mS, rS 

MR 

Valid 

Valid 

AND rS with r/mS, set SF, ZF, PF according to 
result. 

REX + 84 Ir 

TEST r/mS* rS* 

MR 

Valid 

N.E. 

AND rS with r/mS, set SF, ZF, PF according to 
result. 

85 Ir 

TEST r/m 16, r16 

MR 

Valid 

Valid 

AND r76 with r/m 76; set SF, ZF, PF according 
to result. 

85 Ir 

TEST r/m32, r32 

MR 

Valid 

Valid 

AND r32 with r/m32: set SF, ZF, PF according 
to result. 

REX.W + 85 Ir 

TEST r/m64, r64 

MR 

Valid 

N.E. 

AND r64 with r/m64: set SF, ZF, PF according 
to result. 


NOTES: 

* In 64-blt mode, r/m8 can not be encoded to access the following byte registers If a REX prefix is used: AH, BH, CH, DH. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

1 

AL/AX/EAX/RAX 

imm8/16/32 

NA 

NA 

Ml 

ModRM:r/m (r) 

imm8/16/3Z 

NA 

NA 

MR 

ModRM:r/m (r) 

ModRMireg (r) 

NA 

NA 


Description 

Computes the bit-wise logical AND of first operand (source 1 operand) and the second operand (source 2 operand) 
and sets the SF, ZF, and PF status flags according to the result. The result is then discarded. 

In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a 
REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this 
section for encoding data and limits. 
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Operation 

TEMP ^ SRC1 AND SRC2; 

SF ^ MSB(TEMP); 

IF TEMP =0 
THEN ZF^ 1; 

ELSE ZF ^ 0; 

FI: 

PF ^ BltwlseXN0R(TEMP[0:7]); 

CF ^ 0; 

OF^O; 

(* AF is undefined *) 

Flags Affected 

The OF and CF flags are set to 0. The SF, ZF, and PF flags are set according to the result (see the "Operation" section 
above). The state of the AF flag is undefined. 

Protected Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

If the DS, ES, FS, or GS register contains a NULL segment selector. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 

Real-Address Mode Exceptions 

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS If a memory operand effective address is outside the SS segment limit. 

#UD If the LOCK prefix is used. 

Virtual-SOSe Mode Exceptions 

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. 

#SS(0) If a memory operand effective address is outside the SS segment limit. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made. 

#UD If the LOCK prefix is used. 

Compatibility Mode Exceptions 

Same exceptions as in protected mode. 

e4-Bit Mode Exceptions 

#SS(0) If a memory address referencing the SS segment is in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) If a page fault occurs. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

#UD If the LOCK prefix is used. 
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TZCNT — Count the Number of Trailing Zero Bits 


Opcode/ 

Instruction 

Op/ 

En 

64/32 

-bit 

Mode 

CPUID 

Feature 

Flag 

Description 

F3 OF BC /r 

JZCHJ r16, r/ml6 

RM 

V/V 

BMI1 

Count the number of trailing zero bits in r/m76, return result in rl6. 

F3 OF BC /r 

TZCNT r32, r/m3Z 

RM 

v/v 

BMI1 

Count the number of trailing zero bits in r/m32, return result in r32. 

F3 REX.W OF BC /r 

TZCNT r64, r/m64 

RM 

V/N.E. 

BMI1 

Count the number of trailing zero bits in r/m64, return result in r64. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

A 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

TZCNT counts the number of trailing least significant zero bits in source operand (second operand) and returns the 
result in destination operand (first operand). TZCNT is an extension of the BSF instruction. The key difference 
between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero 
while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined. On 
processors that do not support TZCNT, the instruction byte encoding is executed as BSF. 

Operation 

temp <- 0 
BEST ^ 0 

DO WFIILE ((temp < OperandSize) and (SRC[ temp] = 0)) 

temp temp +1 
DEST ^ DEST+ 1 
OD 

IF DEST = OperandSize 
CF^ 1 
ELSE 
CF^O 
FI 

IF DEST = 0 
ZF^ 1 
ELSE 
ZF^O 
FI 

Flags Affected 

ZF is set to 1 in case of zero output (least significant bit of the source is set), and to 0 otherwise, CF is set to 1 if 
the input was zero and cleared otherwise. OF, SF, PF and AF flags are undefined. 

Intel C/C++ Compiler Intrinsic Equivalent 

TZCNT: unsigned IntBZ _tzcnt_u3Z(unsigned int32 src); 

TZCNT: unsigned int64_tzcnt_u64(unsigned int64 src); 
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Protected Mode Exceptions 

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments. 

If the DS, ES, FS, or GS register is used to access memory and it contains a null segment 
selector. 

#SS(0) For an illegal address in the SS segment. 

#PF (fault-code) For a page fault. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

Real-Address Mode Exceptions 

#GP(0) If any part of the operand lies outside of the effective address space from 0 to OFFFFH. 

#SS(0) For an illegal address in the SS segment. 

Virtual 8086 Mode Exceptions 

#GP(0) If any part of the operand lies outside of the effective address space from 0 to OFFFFH. 

#SS(0) For an illegal address in the SS segment. 

#PF (fault-code) For a page fault. 

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the 

current privilege level is 3. 

Compatibility Mode Exceptions 

Same exceptions as in Protected Mode. 


64-Bit Mode Exceptions 


#GP(0) 

#SS(0) 

#PF (fault-code) 
#AC(0) 


If the memory address is in a non-canonical form. 

If a memory address referencing the SS segment is in a non-canonical form. 

For a page fault. 

If alignment checking is enabled and an unaligned memory reference is made while the 
current privilege level is 3. 
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UCOMISD—Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS 


Opcode/ 

Instruction 

Op/ 

En 

64/3Z 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

66 OF ZE /r 

UCOMISD xmmi, xmm2/m64 

RM 

V/V 

SSEZ 

Compare low double-precision floating-point values in 
xmmi and xmmZ/mem64 and set the EFLAGS flags 
accordingly. 

VEX.128.66.0F.WIC2E/r 

VUCOMISD xmmi, xmmZ/m64 

RM 

v/v 

AVX 

Compare low double-precision floating-point values in 
xmmi and xmmZ/mem64 and set the EFLAGS flags 
accordingly. 

EVEX.LIG.66.0F.W1 2E/r 

VUCOMISD xmmi, xmm2/m64{sae} 

T1S 

V/V 

AVX51ZF 

Compare low double-precision floating-point values in 
xmmi and xmmZ/m64 and set the EFLAGS flags 
accordingly. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand Z 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

NA 

NA 

T1S 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Performs an unordered compare of the double-precision floating-point values in the low quadwords of operand 1 
(first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according 
to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set 
to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN). 

Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory 
location. 

The UCOMISD instruction differs from the COMISD instruction in that it signals a SIMD floating-point invalid oper¬ 
ation exception (#1) only when a source operand is an SNaN. The COMISD instruction signals an invalid numeric 
exception only if a source operand is either an SNaN or a QNaN. 

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD. 

Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter 
unpredictable behavior across different processor generations. 

Operation 

(V)UCOMISD (all versions) 

RESULT ^ UnorderedCompare(DEST[63:0] < > SRC[63:0]) { 

(* Set EFLAGS *) CASE (RESULT) OF 
UNORDERED:ZF,PF,CF ^ 111; 

GREATER_THAN: ZF,PF,CF ^ 000; 

LESS_THAN: ZF,PF,CF ^ 001; 

EQUAL: ZF,PF,CF ^ 100; 

ESAC; 

OF, AF, SF ^ 0;} 


UCOMISD—Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS 
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Intel C/C++ Compiler Intrinsic Equivaient 

VUCOMISD int _mm_comi_round_sd(_ml 28d a,_ml 28d b, Int imm, int sae); 

UCOMISD int_mm_ucomieq_sd(_m128d a,_m128d b) 

UCOMISD int_mm_ucomilt_sd(_m128d a,_m128d b) 

UCOMISD lnt_mm_ucomlle_sd(_ml 28d a,_ml 28d b) 

UCOMISD lnt_mm_ucomlgt_sd(_m128d a,_m128d b) 

UCOMISD lnt_mm_ucomlge_sd(_m128d a,_m128d b) 

UCOMISD Int _mm_ucomlneq_sd(_ml 28d a,_ml 28d b) 

SIMD Floating-Point Exceptions 

Invalid (if SNaN operands), Denormal 

Other Exceptions 

VEX-encoded instructions, see Exceptions Type 3; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instructions, see Exceptions Type E3NF. 
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UCOMISS—Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 2E /r 

UCOMISS xmmi, xmm2/m32 

RM 

V/V 

SSE 

Compare low single-precision floating-point values in 
xmmi and xmm2/mem32 and set the EFLAGS flags 
accordingly. 

VEX.128.0F.WIG 2E/r 

VUCOMISS xmmi, xmm2/m32 

RM 

v/v 

AVX 

Compare low single-precision floating-point values in 
xmmi and xmm2/mem32 and set the EFLAGS flags 
accordingly. 

EVEX.LIG.OF.WO ZE /r 

VUCOMISS xmm 1, xmm2/m32{sae} 

T1S 

V/V 

AVX512F 

Compare low single-precision floating-point values in 
xmmi and xmm2/mem32 and set the EFLAGS flags 
accordingly. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r) 

ModRM:r/m (r) 

NA 

NA 

T1S 

ModRM:reg (w) 

ModRM:r/m (r) 

NA 

NA 


Description 

Compares the single-precision floating-point values in the low doublewords of operand 1 (first operand) and 
operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unor¬ 
dered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unor¬ 
dered result is returned if either source operand is a NaN (QNaN or SNaN). 

Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location. 

The UCOMISS instruction differs from the COMISS instruction in that it signals a SIMD floating-point invalid opera¬ 
tion exception (#1) only if a source operand is an SNaN. The COMISS instruction signals an invalid numeric excep¬ 
tion when a source operand is either a QNaN or SNaN. 

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. 

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD. 

Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter 
unpredictable behavior across different processor generations. 

Operation 

(V)UCOMISS (all versions) 

RESULT ^ UnorderedCompare(DEST[31:0] <> SRC[31:0]) { 

(* Set EFLAGS *) CASE (RESULT) OF 
UNORDERED:ZF,PF,CF ^ 111; 

GREATER_THAN: ZF,PF,CF ^ 000; 

LESS_THAN: ZF,PF,CF ^ 001; 

EQUAL: ZF,PF,CF ^ 100; 

ESAC; 

OF, AF, SF ^ 0;} 
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Intel C/C++ Compiler Intrinsic Equivaient 

VUCOMISS Int _mm_comi_round_ss(_ml 28 a,_ml 28 b, Int Imm, Int sae); 

UCOMISS int_mm_ucomieq_ss(_ml 28 a,_ml 28 b); 

UCOMISS int_mm_ucomilt_ss(_ml 28 a,_ml 28 b); 

UCOMISS Int _mm_ucomlle_ss(_ml 28 a,_ml 28 b); 

UCOMISS Int _mm_ucomlgt_ss(_ml 28 a,_ml 28 b); 

UCOMISS lnt_mm_ucomlge_ss(_ml 28 a,_ml 28 b); 

UCOMISS lnt_mm_ucomlneq_ss(_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

Invalid (if SNaN Operands), Denormal 

Other Exceptions 

VEX-encoded instructions, see Exceptions Type 3; additionally 
#UD If VEX.vvvv != llllB. 

EVEX-encoded instructions, see Exceptions Type E3NF. 
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UD2—Undefined Instruction 


Opcode 

Instruction 

Op/ 

En 

64-Bit 

Mode 

Compat/ 
Leg Mode 

Description 

OF OB 

UD2 

NP 

Valid 

Valid 

Raise invalid opcode exception. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

NP 

NA 

NA 

NA 

NA 


Description 

Generates an invalid opcode exception. This instruction is provided for software testing to explicitly generate an 
invalid opcode exception. The opcode for this instruction is reserved for this purpose. 

Other than raising the invalid opcode exception, this instruction has no effect on processor state or memory. 

Even though it is the execution of the UD2 instruction that causes the invalid opcode exception, the instruction 
pointer saved by delivery of the exception references the UD2 instruction (and not the following instruction). 

This instruction's operation is the same in non-64-bit modes and 64-bit mode. 

Operation 

#UD (* Generates invalid opcode exception *); 

Flags Affected 

None. 

Exceptions (All Operating Modes) 

#UD Raises an invalid opcode exception in all operating modes. 


UD2—Undefined Instruction 
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UNPCKHPD—Unpack and Interleave High Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

66 OF 15 /r 

UNPCKHPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Unpacks and Interleaves double-precision floating-point 
values from high guadwords of xmmi and 
xmm2/m128. 

VEX.NDS.128.66.0F.WIG 15/r 
VUNPCKHPD xmm1,xmm2, 
xmm3/nn128 

RVM 

v/v 

AVX 

Unpacks and Interleaves double-precision floating-point 
values from high guadwords of xmm2 and 
xmm3/m128. 

VEX.NDS.256.66.0F.WIG 15 /r 
VUNPCKHPD ymm1,ymm2, 
ymm3/m256 

RVM 

V/V 

AVX 

Unpacks and Interleaves double-precision floating-point 
values from high guadwords of ymm2 and 
ymm3/m256. 

EVEX.NDS.128.66.0F.W1 15/r 
VUNPCKHPD xmmi [kl }[z}, xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Unpacks and Interleaves double precision floating-point 
values from high guadwords of xmm2 and 
xmm3/m128/m64bcst subject to writemask kl. 

EVEX.NDS.256.66.0F.W1 15/r 
VUNPCKHPD ymmi {k1}{z}, ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Unpacks and Interleaves double precision floating-point 
values from high guadwords of ymm2 and 
ymm3/m256/m64bcst subject to writemask kl. 

EVEX.NDS.512.66.0F.W1 15/r 
VUNPCKHPD zmmi {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Unpacks and Interleaves double-precision floating-point 
values from high guadwords of zmm2 and 
zmm3/m512/m64bcst subject to writemask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an interleaved unpack of the high double-precision floating-point values from the first source operand and 
the second source operand. See Figure 4-15 in the Intel® 64 and IA-32 Architectures Software Developer's 
Manual, Volume 2B. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch 
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be 
enforced. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand can be a VMM 
register or a 256-bit memory location. The destination operand is a VMM register. 

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM 
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina¬ 
tion operand is a ZMM register, conditionally updated using writemask kl. 

EVEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM 
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina¬ 
tion operand is a VMM register, conditionally updated using writemask kl. 

EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM 
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina¬ 
tion operand is a XMM register, conditionally updated using writemask kl. 
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Operation 

VUNPCKHPD (EVEX encoded versions when SRC2 is a register) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IFVL>= 128 

TMP_DEST[63:0] ^ SRC1 [127:64] 

TMP_DEST[127:64] ^ SRC2[127:64] 

FI; 

IFVL>= 256 

TMP_DEST[191:1 28] ^ SRC1 [255:192] 

TMP_DEST[255:192] ^ SRC2[255:192] 

FI; 

IFVL>= 512 

TMP_DEST[319:256] ^ SRC1 [383:320] 

TMP_DEST[383:320] ^ SRC2[383:320] 

TMP_DEST[447:384] ^ SRC1 [511:448] 

TMP_DEST[511:448] ^ SRC2[511:448] 

FI; 

FOR] ^0 TO KL-1 
I ^ J * 64 

IF k10] OR *no writemask* 

THEN DEST[I+63:I] ^ TMP_DEST[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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VUNPCKHPD (EVEX encoded version when SRCZ is memory) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 
IF (EVEX.b = 1) 

THEN TMP_SRC2[l+63:i] ^ SRC2[63:0] 

ELSE TMP_SRC2[I+63:I] ^ SRC2[i+63:i] 

FI; 

ENDFOR; 

IFVL>= 128 

TMP_DEST[63:0] ^ SRC1 [127:64] 

TMP_DEST[127:64] ^ TMP_SRC2[127:64] 

FI; 

IFVL>=256 

TMP_DEST[191:128] ^ SRC1 [255:192] 
TMP_DEST[255:192] ^ TMP_SRC2[255:192] 

FI; 

IFVL>=512 

TMP_DEST[319:256] ^ SRC1 [383:320] 
TMP_DEST[383:320] ^ TMP_SRC2[383:320] 
TMP_DEST[447:384] ^ SRC1 [511:448] 

TMP_DEST[511:448] ^ TMP_SRC2[511:448] 


FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TMP_DEST[I+63:I] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VUNPCKHPD (VEX.256 encoded version) 

DEST[63:0] ^SRCI [127:64] 

DEST[127:64] ^SRC2[127:64] 

DEST[191:128]^SRC1 [255:192] 
DEST[255:192]^SRC2[255:192] 
DEST[MAX_VL-1:256] ^0 


VUNPCKHPD (VEX.128 encoded version) 

DEST[63:0] ^SRCI [127:64] 

DEST[127:64] ^SRC2[127:64] 
DEST[MAX_VL-1:128] ^0 


UNPCKHPD (128-bit Legacy SSE version) 

DEST[63:0] ^SRCI [127:64] 

DEST[127:64] ^SRC2[127:64] 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivalent 

VUNPCKHPD _m512d _mm512_unpackhLpd( _m512d a,_m512d b); 

VUNPCKHPD_m512d_mm512_mask_unpackhl_pd(_m512d s,_mmaskS k,_m512d a,_m512d b); 

VUNPCKHPD_m512d _mm512_maskz_unpackhi_pd(_mmaskS k,_mSI 2d a,_mSI 2d b); 

VUNPCKHPD _m256d _mm256_unpackhLpd(_m256d a,_m256d b) 

VUNPCKHPD_m256d _mm256_mask_unpackhl_pd(_m256d s,_mmaskS k,_m256d a,_m256d b); 

VUNPCKHPD_m256d _mm256_maskz_unpackhi_pd(_mmaskS k,_m256d a,_m256d b); 

UNPCKHPD _m12Sd _mm_unpackhLpd(_m12Sd a, _m12Sd b) 

VUNPCKHPD_ml 2Sd _mm_mask_unpackhi_pd(_ml 2Sd s,_mmaskS k,_ml 2Sd a,_ml 2Sd b); 

VUNPCKHPD_m12Sd_mm_maskz_unpackhl_pd(_mmaskS k,_m12Sd a,_m12Sd b); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instructions, see Exceptions Type 4. 

EVEX-encoded instructions, see Exceptions Type E4NF. 
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UNPCKHPS—Unpack and Interleave High Packed Single-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

En 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 15 /r 

UNPCKHPS xmmi, xmm2/m128 

RM 

V/V 

SSE 

Unpacks and Interleaves single-precision floating-point 
values from high guadwords of xmmi and xmm2/m128. 

VEX.NDS.128.0F.WIG 15/r 
VUNPCKHPS xmm1,xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Unpacks and Interleaves single-precision floating-point 
values from high guadwords of xmm2 and xmm3/m128. 

VEX.NDS.256.0F.WIG 15 /r 
VUNPCKHPS ymm1,ymm2, 
ymm3/m256 

RVM 

V/V 

AVX 

Unpacks and Interleaves single-precision floating-point 
values from high guadwords of ymm2 and ymm3/m256. 

EVEX.NDS.128.0F.W0 15/r 
VUNPCKHPS xmmi [k1}[z],xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Unpacks and Interleaves single-precision floating-point 
values from high guadwords of xmm2 and 
xmm3/m128/m32bcst and write result to xmmi subject to 
writemask k1. 

EVEX.NDS.256.0F.W0 15 /r 
VUNPCKHPS ymmi [k1}[z],ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Unpacks and Interleaves single-precision floating-point 
values from high guadwords of ymm2 and 
ymm3/m256/m32bcst and write result to ymmi subject to 
writemask k1. 

EVEX.NDS.512.0F.W0 15/r 
VUNPCKHPS zmmi [k1 }[z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Unpacks and Interleaves single-precision floating-point 
values from high guadwords of zmm2 and 
zmm3/m512/m32bcst and write result to zmmi subject to 
writemask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an interleaved unpack of the high single-precision floating-point values from the first source operand and 
the second source operand. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch 
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be 
enforced. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

VEX.256 encoded version: The second source operand is an VMM register or an 256-bit memory location. The first 
source operand and destination operands are VMM registers. 
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Figure 4-27. VUNPCKHPS Operation 


EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM 
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina¬ 
tion operand is a ZMM register, conditionally updated using writemask kl. 

EVEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM 
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina¬ 
tion operand is a VMM register, conditionally updated using writemask kl. 

EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM 
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina¬ 
tion operand is a XMM register, conditionally updated using writemask kl. 

Operation 

VUNPCKHPS (EVEX encoded version when SRC2 is a register) 

(KL, VL) = (4,1 28), (8, 256), (16, 512) 

IFVL>= 128 

TMP_DEST[31:0] ^ SRC1 [95:64] 

TMP_DEST[63:32] ^ SRC2[95:64] 

TMP_DEST[95:64] ^ SRC1 [127:96] 

TMP_DEST[127:96] ^ SRC2[127:96] 

FI; 

IFVL>=256 

TMP_DEST[159:128] ^ SRC1 [223:192] 

TMP_DEST[191:160] ^ SRC2[223:192] 

TMP_DEST[223:192] ^ SRC1 [255:224] 

TMP_DEST[255:224] ^ SRC2[255:224] 

FI; 

IFVL>=512 

TMP_DEST[287:256] ^ SRC1 [351:320] 

TMP_DEST[319:288] ^ SRC2[351:320] 

TMP_DEST[351:320] ^ SRC1 [383:352] 

TMP_DEST[383:352] ^ SRC2[383:352] 

TMP_DEST[415:384] ^ SRC1 [479:448] 

TMP_DEST[447:416] ^ SRC2[479:448] 

TMP_DEST[479:448] ^ SRC1 [511:480] 

TMP_DEST[511:480] ^ SRC2[511:480] 

FI; 
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FOR] ^0 TO KL-1 
i^j*32 

IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VUNPCKHPS (EVEX encoded version when SRC2 is memory) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*32 
IF(EVEX.b= 1) 

THEN TMP_SRC2[i+31:i] ^ SRC2[31:0] 

ELSE TMP_SRC2[i+31 :i] ^ SRC2[i+31 :i] 

FI; 

ENDFOR; 

IFVL>= 128 

TMP_DEST[31:0] ^ SRC1 [95:64] 

TMP_DEST[63:32] ^ TMP_SRC2[95:64] 

TMP_DEST[95:64] ^ SRC1 [127:96] 

TMP_DEST[127:96] ^ TMP_SRC2[127:96] 

FI; 

IFVL>=256 

TMP_DEST[159:128] ^ SRC1 [223:192] 

TMP_DEST[191:160] ^ TMP_SRC2[223:192] 

TMP_DEST[223:192] ^ SRC1 [255:224] 

TMP_DEST[255:224] ^ TMP_SRC2[255:224] 

FI; 

IFVL>=512 

TMP_DEST[287:256] ^ SRC1 [351:320] 

TMP_DEST[319:288] ^ TMP_SRC2[351:320] 

TMP_DEST[351:320] ^ SRC1 [383:352] 

TMP_DEST[383:352] ^ TMP_SRC2[383:352] 

TMP_DEST[41 5:384] ^ SRC1 [479:448] 

TMP_DEST[447:416] ^ TMP_SRC2[479:448] 

TMP_DEST[479:448] ^ SRC1 [511:480] 

TMP_DEST[511:480] ^ TMP_SRC2[511:480] 

FI; 

FOR] ^0 TO KL-1 
i^]*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i] ^ TMP_DEST[i+31 :i] 

ELSE 

IF *merglng-masklng* ; mergIng-maskIng 

THEN *DEST[I+31 :l] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+31:i]^0 
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FI 

FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 

VUNPCKHPS (VEX.256 encoded version) 

DEST[31:0] ^SRCI [95:64] 

DEST[63:32] ^SRC2[95:64] 

DEST[95:64] ^SRCI [127:96] 

DEST[127:96] ^SRC2[127:96] 

DEST[159:128] ^SRCI [223:192] 

DEST[191:160] ^SRC2[223:192] 

DEST[223:192] ^SRCI [255:224] 

DEST[255:224] ^SRC2[255:224] 

DEST[MAX_VL-1:256]^0 

VUNPCKHPS (VEX.128 encoded version) 

DEST[31:0] ^SRCI [95:64] 

DEST[63:32] ^SRC2[95:64] 

DEST[95:64] ^SRCI [127:96] 

DEST[127:96] ^SRC2[127:96] 

DEST[MAX_VL-1:128] ^0 

UNPCKHPS (128-bit Legacy SSE version) 

DEST[31:0] ^SRCI [95:64] 

DEST[63:32] ^SRC2[95:64] 

DEST[95:64] ^SRCI [127:96] 

DEST[127:96] ^SRC2[127:96] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VUNPCKHPS _m512 _mm512_unpacl<hLps( _m512 a, _m512 b); 

VUNPCKHPS_m512 _mm512_mask_unpackhi_ps(_m512 s,_mmask16 k,_m512 a,_m512 b); 

VUNPCKHPS_m512 _mm512_maskz_unpackhi_ps(_mmaski 6 k,_m512 a,_m512 b); 

VUNPCKHPS _m256 _mm256_unpackhLps (_m256 a, _m256 b); 

VUNPCKHPS_m256 _mm256_mask_unpackhi_ps(_m256 s,_mmask8 k,_m256 a,_m256 b); 

VUNPCKHPS_m256 _mm256_maskz_unpackhij)s(_mmask8 k,_m256 a,_m256 b); 

UNPCKHPS _m128 _mm_unpackhLps (_m128 a, _m128 b); 

VUNPCKHPS_ml 28 _mm_mask_unpackhi_ps(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VUNPCKHPS_ml 28 _mm_maskz_unpackhi_ps(_mmask8 k,_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instructions, see Exceptions Type 4. 

EVEX-encoded instructions, see Exceptions Type E4NF. 
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UNPCKLPD—Unpack and Interleave Low Packed Double-Precision Floating-Point Values 


Opcode/ 

Instruction 

Op/ 

Gn 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Fiag 

Description 

66 OF 14/r 

UNPCKLPD xmmi, xmm2/m128 

RM 

V/V 

SSE2 

Unpacks and Interleaves double-precision floating-point 
values from low quadwords of xmmi and xmm2/m128. 

VEX.NDS.128.66.0F.WIG 14/r 
VUNPCKLPD xmm1,xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Unpacks and Interleaves double-precision floating-point 
values from low quadwords of xmm2 and xmm3/m128. 

VEX.NDS.256.66.0F.WIG 14/r 
VUNPCKLPD ymm1,ymm2, 
ymm3/m256 

RVM 

V/V 

AVX 

Unpacks and Interleaves double-precision floating-point 
values from low quadwords of ymm2 and ymm3/m256. 

EVEX.NDS.128.66.0F.W1 14/r 
VUNPCKLPD xmmi {k1}{z},xmm2, 
xmm3/m128/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Unpacks and Interleaves double precision floating-point 
values from low quadwords of xmm2 and 
xmm3/m128/m64bcst subject to write mask kl. 

EVEX.NDS.256.66.0F.W1 14/r 
VUNPCKLPDymmi {k1]{z},ymm2, 
ymm3/m256/m64bcst 

FV 

v/v 

AVX512VL 

AVX512F 

Unpacks and Interleaves double precision floating-point 
values from low quadwords of ymm2 and 
ymm3/m256/m64bcst subject to write mask kl. 

EVEX.NDS.512.66.0F.W1 14/r 
VUNPCKLPD zmmi {k1}{z}, zmm2, 
zmm3/m512/m64bcst 

FV 

v/v 

AVX512F 

Unpacks and Interleaves double-precision floating-point 
values from low quadwords of zmm2 and 
zmm3/m512/m64bcst subject to write mask kl. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an interleaved unpack of the low double-precision floating-point values from the first source operand and 
the second source operand. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch 
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be 
enforced. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand can be a VMM 
register or a 256-bit memory location. The destination operand is a VMM register. 

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM 
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina¬ 
tion operand is a ZMM register, conditionally updated using writemask kl. 

EVEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM 
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina¬ 
tion operand is a VMM register, conditionally updated using writemask kl. 

EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM 
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina¬ 
tion operand is a XMM register, conditionally updated using writemask kl. 
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Operation 

VUNPCKLPD (EVEX encoded versions when SRC2 is a register) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

IFVL>= 128 

TMP_DEST[63:0] ^ SRC1 [63:0] 

TMP_DEST[127:64] ^ SRC2[63:0] 

FI; 

IFVL>= 256 

TMP_DEST[191:128] ^ SRC1 [191:128] 

TMP_DEST[255:192] ^ SRC2[191:128] 

FI; 

IFVL>= 512 

TMP_DEST[319:256] ^ SRC1 [319:256] 

TMP_DEST[383:320] ^ SRC2[319:256] 

TMP_DEST[447:384] ^ SRC1 [447:384] 

TMP_DEST[511:448] ^ SRC2[447:384] 


FOR] ^0 TO KL-1 
I ^ ] * 64 

IF k10] OR *no writemask* 

THEN DEST[I+63:I] ^ TMP_DEST[i+63:i] 

ELSE 

IF *merglng-masking* ; merging-masking 

THEN *DEST[i+63:i] remains unchanged* 

ELSE *zeroing-masking* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1:VL]^0 
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VUNPCKLPD (EVEX encoded version when SRC2 is memory) 

(KL, VL) = (2,128), (4, 256), (8, 512) 

FOR] ^0 TO KL-1 
i ^ j * 64 
IF (EVEX.b = 1) 

THEN TMP_SRC2[l+63:i] ^ SRC2[63:0] 

ELSE TMP_SRC2[I+63:I] ^ SRC2[i+63:i] 

FI; 

ENDFOR; 

IFVL>= 128 

TMP_DEST[63:0] ^ SRC1 [63:0] 

TMP_DEST[127:64] ^ TMP_SRC2[63:0] 

FI; 

IFVL>=256 

TMP_DEST[191:128] ^ SRC1 [191:128] 

TMP_DEST[255:192] ^ TMP_SRC2[191:128] 

FI; 

IFVL>=512 

TMP_DEST[319:256] ^ SRC1 [319:256] 

TMP_DEST[383:320] ^ TMP_SRC2[319:256] 

TMP_DEST[447:384] ^ SRC1 [447:384] 

TMP_DEST[511:448] ^ TMP_SRC2[447:384] 

FI; 

FOR] ^0 TO KL-1 
i ^ ] * 64 

IF k1 [j] OR *no writemask* 

THEN DEST[l+63:i] ^ TMP_DEST[I+63:I] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 

THEN *DEST[I+63:I] remains unchanged* 

ELSE *zerolng-masklng* ; zeroing-masking 

DEST[i+63:i] ^ 0 
FI 
FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 


VUNPCKLPD (VEX.256 encoded version) 

DEST[63:0] ^SRC1[63:0] 

DEST[127:64] ^SRC2[63:0] 

DEST[191:128] ^SRCI [191:128] 
DEST[255:192] ^SRC2[191:128] 
DEST[MAX_VL-1:256]^0 


VUNPCKLPD (VEX.128 encoded version) 

DEST[63:0] ^SRC1[63:0] 

DEST[127:64] ^SRC2[63:0] 
DEST[MAX_VL-1:128] ^0 


UNPCKLPD (128-bit Legacy SSE version) 

DEST[63:0] ^SRC1[63:0] 

DEST[127:64] ^SRC2[63:0] 
DEST[MAX_VL-1:128] (Unmodified) 
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Intel C/C++ Compiler Intrinsic Equivalent 

VUNPCKLPD _m512d _mm512_unpacklo_pd( _m512d a, _m512d b); 

VUNPCKLPD_m512d _mm512_mask_unpacklo_pd(_mSI 2d s,_mmaskS k,_mSI 2d a,_mSI 2d b); 

VUNPCKLPD_m512d _mm512_maskz_unpacklo_pd(_mmaskS k,_mSI 2d a,_mSI 2d b); 

VUNPCKLPD _m256d _mm256_unpacklo_pd(_m256d a, _m256d b) 

VUNPCKLPD_m256d _mm256_mask_unpacklo_pd(_m256d s,_mmaskS k,_m256d a,_m256d b); 

VUNPCKLPD_m256d _mm256_maskz_unpacklo_pd(_mmaskS k,_m256d a,_m256d b); 

UNPCKLPD _m128d _mm_unpacklo_pd(_m128d a_ml 28d b) 

VUNPCKLPD_m128d_mm_mask_unpacklo_pd(_m128d s,_mmask8 k,_m128d a,_m128d b); 

VUNPCKLPD_ml 28d _mm_maskz_unpacklo_pd(_mmask8 k,_m128d a,_m128d b); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instructions, see Exceptions Type 4. 

EVEX-encoded instructions, see Exceptions Type E4NF. 
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UNPCKLPS—Unpack and Interleave Low Packed Sing 

le-Precision Floating-Point Values 

Opcode/ 

Instruction 

Op/ 

Gn 

64/32 
bit Mode 
Support 

CPUID 

Feature 

Flag 

Description 

OF 14/r 

UNPCKLPS xmmi, xmnn2/nn128 

RM 

V/V 

SSE 

Unpacks and Interleaves single-precision floating-point 
values from low quadwords of xmmi and xmm2/m128. 

VEX.NDS.128.0F.WIC 14/r 
VUNPCKLPS xmm1,xmm2, 
xmm3/m128 

RVM 

v/v 

AVX 

Unpacks and Interleaves single-precision floating-point 
values from low quadwords of xmm2 and xmm3/m128. 

VEX.NDS.256.0F.WIG 14/r 
VUNPCKLPS 

ymmi ,ymm2,ymm3/m256 

RVM 

V/V 

AVX 

Unpacks and Interleaves single-precision floating-point 
values from low quadwords of ymm2 and ymm3/m256. 

EVEX.NDS.128.0F.W0 14/r 
VUNPCKLPS xmmi {k1 }{z}, xmm2, 
xmm3/m128/m32bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Unpacks and Interleaves single-precision floating-point 
values from low quadwords of xmm2 and xmm3/mem and 
write result to xmmi subject to write mask k1. 

EVEX.NDS.256.0F.W0 14/r 
VUNPCKLPS ymmi {k1 ]{z}, ymm2, 
ymm3/m256/m32bcst 

FV 

v/v 

AVX512VL 
AVX512F 

Unpacks and Interleaves single-precision floating-point 
values from low quadwords of ymm2 and ymm3/mem and 
write result to ymmi subject to write mask k1. 

EVEX.NDS.512.0F.W0 14/r 
VUNPCKLPS zmmi {k1]{z}, zmm2, 
zmm3/m512/m32bcst 

FV 

v/v 

AVX512F 

Unpacks and Interleaves single-precision floating-point 
values from low quadwords of zmm2 and 
zmm3/m512/m32bcst and write result to zmmi subject 
to write mask k1. 


Instruction Operand Encoding 


Op/En 

Operand 1 

Operand 2 

Operand 3 

Operand 4 

RM 

ModRM:reg (r, w) 

ModRM:r/m (r) 

NA 

NA 

RVM 

ModRM:reg (w) 

VEX.vvvv (r) 

ModRM:r/m (r) 

NA 

FV 

ModRM:reg (w) 

EVEX.vvvv (r) 

ModRM:r/m (r) 

NA 


Description 

Performs an interleaved unpack of the low single-precision floating-point values from the first source operand and 
the second source operand. 

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti¬ 
nation is not distinct from the first source XMM register and the upper bits (MAX_VL-1:128) of the corresponding 
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch 
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be 
enforced. 

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM 
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAX_VL-1:128) 
of the corresponding ZMM register destination are zeroed. 

VEX.256 encoded version: The first source operand is a VMM register. The second source operand can be a VMM 
register or a 256-bit memory location. The destination operand is a VMM register. 
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Figure 4-28. VUNPCKLPS Operation 


EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM 
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina¬ 
tion operand is a ZMM register, conditionally updated using writemask kl. 

EVEX.256 encoded version: The first source operand is a VMM register. The second source operand is a VMM 
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina¬ 
tion operand is a VMM register, conditionally updated using writemask kl. 

EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM 
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina¬ 
tion operand is a XMM register, conditionally updated using writemask kl. 

Operation 

VUNPCKLPS (EVEX encoded version when SRC2 is a ZMM register) 

(KL, VL) = (4,1 28), (8, 256), (16, 512) 

IFVL>= 128 

TMP_DEST[31:0] ^ SRC1 [31:0] 

TMP_DEST[63:32] ^ SRC2[31:0] 

TMP_DEST[95:64] ^ SRC1 [63:32] 

TMP_DEST[127:96] ^ SRC2[63:32] 

FI; 

IFVL>= 256 

TMP_DEST[159:1 28] ^ SRC1 [159:128] 

TMP_DEST[191:160] ^ SRC2[159:128] 

TMP_DEST[223:192] ^ SRC1 [191:160] 

TMP_DEST[255:224] ^ SRC2[191:160] 

FI; 

IFVL>= 512 

TMP_DEST[287:256] ^ SRC1 [287:256] 

TMP_DEST[319:288] ^ SRC2[287:256] 

TMP_DEST[351:320] ^ SRC1 [319:288] 

TMP_DEST[383:352] ^ SRC2[319:288] 

TMP_DEST[415:384] ^ SRC1 [415:384] 

TMP_DEST[447:416] ^ SRC2[415:384] 

TMP_DEST[479:448] ^ SRC1 [447:416] 

TMP_DEST[511:480] ^ SRC2[447:416] 

FI; 

FOR] ^0 TO KL-1 
i ^]*32 
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IF k1 [j] OR *no writemask* 

THEN DEST[I+31 :i] ^ TMP_DEST[I+31 :l] 
ELSE 


IF *merglng-masklng* 

THEN *DEST[I+31:I] remains 
ELSE *zerolng-masklng* 


; merglng-masklng 
unchanged* 

; zeroing-masking 


DEST[i+31:i]^0 

FI 


FI; 

ENDFOR 

DEST[MAX_VL-1 :VL] ^ 0 

VUNPCKLPS (EVEX encoded version when SRCZ is memory) 

(KL, VL) = (4,128), (8, 256), (16, 512) 

FOR] ^0 TO KL-1 
i^j*31 
IF (EVEX.b = 1) 

THEN TMP_SRC2[l+31:i] ^ SRC2[31:0] 

ELSE TMP_SRC2[I+31 :l] ^ SRC2[i+31 :i] 

FI; 

ENDFOR; 

IFVL>= 128 

TMP_DEST[31:0] ^ SRC1 [31:0] 

TMP_DEST[63:32] ^ TMP_SRC2[31:0] 

TMP_DEST[95:64] ^ SRC1 [63:32] 

TMP_DEST[127:96] ^ TMP_SRC2[63:32] 

FI; 

IFVL>=256 

TMP_DEST[159:128] ^ SRC1 [159:128] 

TMP_DEST[191:160] ^ TMP_SRC2[159:128] 
TMP_DEST[223:192] ^ SRC1 [191:160] 
TMP_DEST[255:224] ^ TMP_SRC2[191:160] 


FI; 


IFVL>=512 

TMP_DEST[287:256] ^ SRC1 [287:256] 
TMP_DEST[319:288] ^ TMP_SRC2[287:256] 
TMP_DEST[351:320] ^ SRC1 [319:288] 
TMP_DEST[383:352] ^ TMP_SRC2[319:288] 
TMP_DEST[41 5:384] ^ SRC1 [415:384] 
TMP_DEST[447:416] ^ TMP_SRC2[415:384] 
TMP_DEST[479:448] ^ SRC1 [447:416] 
TMP_DEST[511:480] ^ TMP_SRC2[447:416] 


FI; 


FOR] ^0 TO KL-1 
I ^[*32 

IF k1 [j] OR *no writemask* 

THEN DEST[i+31 :i] ^ TMP_DEST[i+31 :i] 

ELSE 

IF *merglng-masklng* ; merglng-masklng 


THEN *DEST[I+31 :l] remains unchanged 


ELSE *zerolng-masklng 
DEST[i+31:i]^0 


; zeroing-masking 


FI 


FI; 
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ENDFOR 

DEST[MAX_VL-1:VL]^0 

UNPCKLPS (VEX.256 encoded version) 

DEST[31:0] ^SRC1[31:0] 

DEST[63:32] ^SRC2[31:0] 

DEST[95:64] ^SRCI [63:32] 

DEST[127:96] ^SRC2[63:32] 

DEST[159:128] ^SRCI [159:128] 

DEST[191:160] ^SRC2[159:128] 

DEST[223:192] ^SRCI [191:160] 

DEST[255:224] ^SRC2[191:160] 

DEST[MAX_VL-1:256]^0 

VUNPCKLPS (VEX.128 encoded version) 

DEST[31:0] ^SRC1[31:0] 

DEST[63:32] ^SRC2[31:0] 

DEST[95:64] ^SRCI [63:32] 

DEST[127:96] ^SRC2[63:32] 

DEST[MAX_VL-1:128] ^0 

UNPCKLPS (128-bit Legacy SSE version) 

DEST[31:0] ^SRC1[31:0] 

DEST[63:32] ^SRC2[31:0] 

DEST[95:64] ^SRCI [63:32] 

DEST[127:96] ^SRC2[63:32] 

DEST[MAX_VL-1:128] (Unmodified) 

Intel C/C++ Compiler Intrinsic Equivalent 

VUNPCKLPS _m512 _mm512_unpacl<lo_ps(_m512 a,_m512 b); 

VUNPCKLPS_m512_mm512_mask_unpacklo_ps(_m512 s,_mmask16 k,_m512 a,_m512 b); 

VUNPCKLPS_m512 _mm512_maskz_unpacklo_ps(_mmask16 k,_m512 a,_m512 b); 

VUNPCKLPS _m256 _mm256_unpacklo_ps (_m256 a, _m256 b); 

VUNPCKLPS_m256 _mm256_mask_unpacklo_ps(_m256 s,_mmask8 k,_m256 a,_m256 b); 

VUNPCKLPS_m256 _mm256_maskz_unpacklo_ps(_mmask8 k,_m256 a,_m256 b); 

UNPCKLPS _m128 _mm_unpacklo_ps (_m128 a, _m128 b); 

VUNPCKLPS_ml 28 _mm_mask_unpacklo_ps(_ml 28 s,_mmask8 k,_ml 28 a,_ml 28 b); 

VUNPCKLPS_ml 28 _mm_maskz_unpacklo_ps(_mmask8 k,_ml 28 a,_ml 28 b); 

SIMD Floating-Point Exceptions 

None 

Other Exceptions 

Non-EVEX-encoded instructions, see Exceptions Type 4. 

EVEX-encoded instructions, see Exceptions Type E4NF. 
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